Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Paramount – Human Evals of AI Customer Support (github.com/ask-fini)
71 points by hakimk 7 months ago | hide | past | favorite | 44 comments
Hey HN, Hakim here from Fini (YC S22), a startup focused on providing automated customer support bots for enterprises that have a high volume of support requests.

Today, one of the largest use cases of LLMs is for the purpose of automating support. As the space has evolved over the past year, there has subsequently been a need for evaluations of LLM outputs - and a sea of LLM Evals packages have been released. "LLM evals" refer to the evaluation of large language models, assessing how well these AI systems understand and generate human-like text. These packages have recently relied on "automatic evals," where algorithms (usually another LLM) automatically test and score AI responses without human intervention.

In our day to day work, we have found that Automatic Evals are not enough to get the required 95% accuracy for our Enterprise customers. Automatic Evals are efficient, but still often miss nuances that only human expertise can catch. Automatic Evals can never replace the feedback of a trained human who is deeply knowledgeable on an organization's latest product releases, knowledgebase, policies and support issues. The key to solve this is to stop ignoring the business side of the problem, and start involving knowledgeable experts in the evaluation process.

That is why we are releasing Paramount - an Open Source package which incorporates human feedback directly into the evaluation process. By simplifying the step of gathering feedback, ML Engineers can pinpoint and fix accuracy issues (prompts, knowledgebase issues) much faster. Paramount provides a framework for recording LLM function outputs (ground truth data) and facilitates human agent evaluations through a simple UI, reducing the time to identify and correct errors.

Developers can integrate Paramount with a Python decorator that logs LLM interactions into a database, followed by a straightforward UI for expert review. This process aids the debugging and validation phase of launching accurate support bots. We'd love to hear what you think!




“Customer support by AI” sounds even worse than the current practice of having customer support provided by utterly clueless and unhelpful “support” from barely-English-speaking-people from the lowest-wage-country-we-could find (stereotypically India).

How did you come to hate people so deeply that you made it your life goal to make an already abysmal experience even worse?


I was on a small team of 10 that experienced very strong growth in our product. One of the consequences of success is that we had to take hundreds of calls per day. We eventually had to hire dedicated customer support people but after hours were an issue.

There are large outsourced customer service companies but those cost $1 a minute and those people have to deal with CS calls from many different companies, so it took them forever to find scripts relevant to the customer’s request. Most calls failed - that outsourced CS was a glorified voicemail box.

Our success rate with outsourced CS was very low, partly because we didn’t have the resources to fly over to Austin to train a workforce that didn’t show to be particularly promising.

AI voice bots would have been able to helpfully answer and deflect 90% of our calls, and do so in a much faster and humane way.

I understand the sentiment behind your comment, but I cannot agree with your assessment of the outcome.


>AI voice bots would have reduced been able to helpfully answer and deflect 90% of our calls, and do so in a much faster and humane way.

Would they have? I've worked at companies that had AI voice bots, they required a ton of care and feeding like a script. Only reason we used them is they were better at following the script and they didn't cause massive customer backlash compared to customer backlash we got when it was offshored.


So many of our calls were informational with a predictable/scriptable answer set. 90% may be high, but I know 80% of the calls would have been squashed by an AI bot.

There's definitely a ton of care that you have to put into creating it aside from just retrieval, like control flow, but the state of our outsourced CS was really, really bad. 5 minute wait times just to get connected to someone that mispronounced our company name and mostly told people to call us in the morning. The costs were absurd for the quality of service we got and they ended up making customers angry.


> 90% may be high, but I know 80% of the calls would have been squashed by an AI bot.

Which one is it 90% or 80%? Is it possible it could've been 70%? 60%? 20%?


I understand where you're coming from - poor support experiences are are the bane of my existence.

Preventing poor support experiences is the exact reason we started Fini. With a long career as a software engineer in the support industry, I have seen first hand when and why these initiatives fail.

Unfortunately, when growth-stage companies experience a surge of signups, they face the choice between either 1) delay responding to all support issues by weeks, due to lack of staffing, or 2) trying to automate/offshore support, leading to poor coverage and accuracy.

Both of these options usually lead to horrendous support experiences such as the ones you mention.

Thankfully, with recent advances such as LLMs, we are one step closer to bridging this gap. In our most successful projects we are able to reach 95% accuracy specifically thanks to keeping a close eye on human evaluations.

That's why we are launching this package as well, which we believe brings us one step closer towards our mission of helping millions of end-users receive the high quality support they deserve within seconds rather than weeks.

It's a challenge which very few (if any!) companies have pulled off well, which is why we have decided to put all our effort into making scalable and hassle-free support our first priority.


> In our most successful projects we are able to reach 95% accuracy

Is this the 95% of issues that are common between customers that could be and are handled at scale with a FAQ/knowledgebase that is surfaced to a user automatically through prompts?

How does it actually do on the smaller percentage of questions that are complex?

And that 95% is on the most successful projects. But what does the distribution of accuracy look like across all projects? If 2 projects out of 1000 are 95% accuracy that doesn't mean much to me.


FAQ/knowledgebase are not enough to reach 95%. If that was the case then this problem would've been solved long ago with the advent of semantic search engines.

You also need the following:

1) General principles of instructions split across topics - essentially distilling a support training manual into a flow chart of prompts.

2) Algorithms to detect best past resolutions to create new knowledge-base material on the go.

3) Continuous retrain and feedback loops which we have written models to perform on behalf of customers.

And continuous R&D effort into how these systems can improve.


It seems like a very convoluted path, but if people are for some strange reason more apt to end up at the FAQ via “chatting” with a tool like this rather than just looking it up directly, this still seems useful, right? I mean what’s the alternative, is there a “more self-directed users as a services” offering out there?


You may have written this yourself, but it totally sounds like a ChatGPT answer. If you stare too long into the abyss, the abyss stares back into you.


Change my view:

If a company has successfully automated a support action (e.g. “help me link 2 accounts”) then it’s a part of the application itself. “Automating customer support” is just a cop out. It’s saying, “we’d love to have this be part of our application, so we don’t have to throw human support at the problem, but we’re not actually sure our automation works”.


Taking a step back though, where has hyper-growth gotten us as an industry and as a society? What if, instead, start-ups throttled their growth to keep a sustainable level of support so that the growth fuels the support instead of just fueling more growth?


The whole startup industry is a bit silly but that isn’t really actionable from any one person’s point of view.


Not to disprove your point (cuz I know how rare this is), but anecdotally: I currently work for a small, bootstrapped company of about 12 people. We never took VC money, was founded by two partners, and just slowly grew by organic word-of-mouth.

I'm one of just a handful of customer support people there, but we try really hard to provide good help to our customers across emails, forums, and live calls in several time zones. We're a headless CMS company, so the questions tend to be a mix of stuff like "how do I do this with your API" and "my Next.js site isn't working, fix it" (even though it's not our code) and "halp, the internet is down" (occasionally it's our infra, but more often it's just a regional outage).

We serve a variety of customer types, from experienced devs to marketers who've never coded before. We try our best to help with all of them, sometimes providing in-depth troubleshooting and mini code reviews for their frontends, all at no additional charge. We're a tiny company in a small niche to begin with, but our customers tend to like us a lot and stick around for a long time. We'll never hyper-scale, but our customers, founders, and employees are happy. None of us will ever become the next tech billionaire, but we have good work-life balances and families and hobbies outside of our jobs. It's sustainable as long as our core product remains relevant. I wish more companies were like that!

In my experience, this sort of setup is more common with smaller companies who deliberately want to maintain an intimate relationship with their customers (and employees, for that matter). As companies get bigger, the human element tends to get lost in the anonymous seas of profit and loss statements. I know it's rare these days, but I really wish we had more small-biz tech companies of a dozen or two people rather than the global (and soon, interplanetary?) hyper-scalers.

I think biologically we just function best at the scale of small villages, where we can actually build rapport and trust over time. Whether it's customer support, a vendor-customer relationship, or even an organic community like a subreddit (or HN), people work together better when they feel like they're part of the same in-group (or at least a federation of allies working towards aligned goals). At a certain scale, personal relationships get replaced by impersonal bureaucracies instead, and then trust and accountability start to crumble. It takes deliberate effort to stay under that scale when the allure of hyperwealth is so strong, but it IS possible.

The last few companies I've enjoyed working at were also smaller outfits, small enough where I could personally answer all web inquiries from our customers. It wasn't technically part of my job description, but I'd go out of my way to try to build trust – a rarer and rarer commodity on the Web these days. If someone complained about a bug on our website, I'd take the responsibility to apologize, find and fix it, and let them know what happened. And at the bigger companies where I wasn't able to have that sort of direct connection with our customers, I'd still work with the customer support team to make sure they at least had a direct line of communication/escalation to me and the other devs. I'd also take time out of my days to learn from the support team to better understand their jobs, how they do it, what their pain points were internally, and what our common customer complaints were. Again, it's rare, but it IS possible, and far easier/more common in the few remaining small companies who actually still care about customer experience and that human touch.

The TLDR is I'd highly encourage people in our field to work for smaller/mid-sized businesses wherever possible :) There is almost always a wage tradeoff, but as long as it's a livable wage, it's totally worth it for the sanity, autonomy, and happiness you get in exchange. And as a customer, I'd far prefer to do business with small companies who have good support vs faceless megacorps.


When the obvious desire of 90% of customer support is to make the customer go away being able to use a box rather than a person to get rid of them seems a perfect solution.

Support is a cost centre, the people calling support are not the ones who will be making purchasing decisions so why provide anything close to decent support. Vendor lock in, subscription services and other ways to reduce the chance that the customer will go elsewhere all contribute to the downward spiral in support.

Truth is, if they can manage to provide proper feedback to the AI for when the support that is provided is actually useful or successful this may actually learn to be better than employing someone to read off a support flowchart that hasn't been updated in 20 years.


I get that naming is hard but what’s the rational behind the naming because there is a large and well funded brand with the same name?


Many companies use this style of "support" in place of employing real people and don't provide a method to escalate to a real human. What guard do I have, as a consumer, against trash like that? What makes yours unique?

When I call for support I expect to receive support. I have never ever had an AI or robot support assistant that actually provides the help that I need when I call.

How do you address the possibility of LLM generating absolutely false information?

Do you actively tell the user where they can contact a real human to provide feedback about the terrible customer support response that your product will undoubtedly provide? Or is this just another dumpster fire of "it sounds like you're having trouble with our product. here's our FAQ..."?

Does your product integrate with internal company APIs? How do you deal with the risk of customers abusing that?


We have a number of safeguards:

1) Our customers can decide which categories of end-user questions should be auto-escalated

2) We ensure that if a question cannot be answered based directly on knowledgebase content, we either escalate or try to ask followups to drill deeper

3) We offer an API which our customers can directly integrate with their own systems. For protection, we either limit the set of allowed actions, or if the customer prefers, they can handle the safeguarding code internally with advice from us.

Lastly I would say the usage of LLMs is young in the support space and things will only improve from here!


Most of this AI stuff is sorely lacking a "feedback loop" If you aren't collecting data on angry/ unhelped customers... you are just flushing your business down the drain

Also maybe it is just me... but I would never call customer service unless i have already exhausted all self help options, meaning i need a real human who can poke into the database or whatever to fix a real edge case problem, who did the calculations to decide these power users (best customers?) can all just die?


How much do you plan to pay for the product? If you're paying a few dollars per month for a consumer product, be aware that having a human answer a single inbound call or email from you can often wipe out the entire year's revenue (not just profit) the company is getting from you.

Either we all need to be willing to pay more in exchange for "premier" support being available or we need to be ok with companies trying to cut down the support load.


> Either we all need to be willing to pay more in exchange for "premier" support being available or we need to be ok with companies trying to cut down the support load.

Third option: if you can't afford to provide human support then you don't deserve to be a business.

Too expensive to provide support? Then raise your prices. If your customers leave then your product isn't viable.


Ok but you’re still making the same point. You can’t have something cheap with a white glove CS experience for every caller. Either you have to have a screen to filter out the complete idiots who will waste their time all day because their mouse isn’t plugged in, like AI, or some minimum wage tier 1, etc. Or it has to be cost more to pay for that cs.


> Ok but you’re still making the same point. You can’t have something cheap with a white glove CS experience for every caller.

That is my point.

The "cheap with a white glove CS experience" is a completely shit experience. If you can't afford to pay for real customer support then either raise your prices or go bankrupt.

Building a business without providing customer support should not ever be a valid business model.


Hey team, nice work. Can you help me understand this better. How does the process work in terms of the human agent evaluations? Is it real time so that the right (maybe a better word is best) answers go to users as they are needed, or is it done asynchronously/batch style so that the humans are training models to be better? Once the best answers are selected, is it fed back into an LLM / AI agent model? Thanks


Following up on my own question. I re-read the github, so I can clarify my question better. So the AI agent responses are saved to a database where the human sme can classify responses as good/bad right? Do you intend for the result of this analysis to retrain the AI agent, or is it purely to get a baseline on the as-is AI agent quality?


Indeed the evaluations are saved to DB. Right now it's possible to use this for regression testing with the help of the Optimize tab. In the optimize tab you can experiment with changing an input parameter (such as prompt or temperature, etc), then rerun recordings and see whether the the LLM response matches all previously accepted recordings or not - to see a similarity score which tells you if your change introduced regressions or not.

In the future we are planning to enable a retraining pipeline - most likely we will do this in our core offering at usefini.com


Thank you for explaining


Hi smarri, I hope you're well.

Sorry this might not be the best place to ask, but a few months ago you posted on my thread about a a part time, remote, Masters Degree in Software Development in the UK of which businesses recruit directly from - my apologies for the late reply, but could I get in touch with you to have more details?


How does Fini annual pricing at $0.076 per question work??


We typically offer a package price for companies with >1k questions per month with white-glove onboarding and numerous custom optimizations to ensure higher performance.

For the fully self-serve pay as you go option, we are able to offer a lower pricing.


I’m asking how do you blend annual billing and usage based together?


Jesus H. Fucking Christ.

I am infuriated by the very idea. Whenever I have to interact with any of this sort of nonsense I always aggressively hit zero and demand to speak to a HUMAN not a machine while cursing wildly.

You are actively making the world a worse place with this “product”.

Don’t think about if you can. Think about if you should.


Slightly off-topic, but slightly not.

I feel like customer support is one of the worst experiences in all of tech or maybe just modern capitalism. Obviously slashing budgets is a huge part of it, but I think the problem is exacerbated by lying with metrics.

One of the great challenges of trusting ANYBODY (even internal teams) with support, is that they have a huge incentive to lie to their management chain about the numbers in every way they can. And their manager or even multiple levels up may turn a blind-eye to these lies if it makes their life easier.

I don't know anything about this product, but I would never trust a company with anything of importance if I don't have a complete audit log of their requests and how they handle them which they send to us for review each month (even a 1%-.1% sample of calls will give a very clear picture into how the process is going).


I am reminded of DHH's recent essay: "It’s easier to forgive a human than a robot"

https://world.hey.com/dhh/it-s-easier-to-forgive-a-human-tha...


Your license is very weird: https://github.com/ask-fini/paramount/blob/main/LICENSE

    This file is part of paramount project, licensed under the GNU General
    Public License (GPL) for companies with fewer than 100 employees or
    fewer than 1000 invocations/month. For larger companies or higher
    volume, a commercial license is required. For more information,
    contact hello@usefini.com.
I'm fine with companies not using open source licenses, but this is a very odd way to do it. Licensing something under the GPL doesn't work like this.

You should look at one of the existing non-open-source licenses like the Business Source License or https://fsl.software/ rather than modifying the GPL by adding an extra paragraph at the top: https://github.com/ask-fini/paramount/commit/8345edd8f776572...

Also, with a license like this it's not accurate to say "Paramount - an Open Source package..." - that's a misuse of the term.


> Licensing something under the GPL doesn't work like this.

Sure it does! The GPL covers this exact scenario.

Section 7 enumerates the additional restrictions you may include alongside the license which will apply to any further distribution. Those are mostly around indemnification, trademarks, etc.

It explicitly says all other non-permissive additions are considered further restrictions and if the program says it is covered by GPL you may remove those terms.

(There’s also section 10, but we don’t need it.)

Since the README says this is “under GPL license for individuals”, and the GPL license says I can remove those terms… without even getting really far into the mud here, I can download a copy of the software, strip those restrictions, and repost it under the GPL sans restrictions for anyone to use.

That all said… it will probably have most of the intended effect. Individuals won’t care about the license much (may limit outside contributors), but no company is going to touch this with a ten foot pole with a hacked up GPL on it, >100 employees or otherwise.


Except Section 7 has this embedded in it:

> When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission.

So I've forked it, I'm less then 100 employees or 1000 invocations and ripped their clause out. Go forth and use as much as you want!

GPL has a ton of clauses that appear to pretty much prevent these shenanigans. Explains why I can't recall any Enterprise backed software like this under GPL.


Yes I believe that was the whole point of my comment. The third paragraph covers this.

The second to last makes the same as your second to last.

So no except… we’re in agreement here.


Thanks for the feedback everyone, we have now removed this paragraph and the whole package is now under GPL regardless of usage!

The idea was to get in touch with enterprises looking to make heavy use of the project - but you're right that this may have unintended effects.


That's a great response - thanks very much for taking this feedback on board.


Great work! Thank you!


I feel like doing OSS like licenses has turned into a way to nerd snipe people into adding comments to your submission.


> Also, with a license like this it's not accurate to say "Paramount - an Open Source package..." - that's a misuse of the term.

It’s not free and open source software (FOSS) that’s for sure. The GPL can’t be used like this, were it so simple plenty of others like Redis or Elasticsearch would have done so. This license is worse than no license.


Maybe work with a specialist attorney on the license if you haven't already?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: