Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Parity (YC S24) – AI for on-call engineers working with Kubernetes
83 points by wilson090 47 days ago | hide | past | favorite | 75 comments
Hey HN — we’re Jeffrey, Coleman, and Wilson, and we’re building Parity (https://tryparity.com), an AI SRE copilot for on-call engineers working with Kubernetes. Before you've opened your laptop, Parity has conducted an investigation to triage, determine root cause, and suggest a remediation for an issue. You can check out a quick demo of Parity here: https://tryparity.com/demo

We met working together as engineers at Crusoe, a cloud provider, and we always dreaded being on-call. It meant a week of putting our lives and projects on hold to be prepared to firefight an issue at any hour of the day. We experienced sleepless nights after being woken up by a PagerDuty alert to then find and follow a runbook. We canceled plans to make time to sift through dashboards and logs in search of the root cause of downtime in our k8s cluster.

After speaking with other devs and SREs, we realized we weren’t alone. While every team wants better monitoring systems or a more resilient design, the reality is that time and resources are often too limited to make these investments.

We’re building Parity to solve this problem. We’re enabling engineers working with Kubernetes to more easily handle their on-call by using AI agents to execute runbooks and conduct root cause analysis. We knew LLMs could help given their ability to quickly process and interpret large amounts of data. But we’ve found that LLMs alone aren’t sufficiently capable, so we’ve built agents to take on more complex tasks like root cause analysis. By allowing on-call engineers to handle these tasks more easily and eventually freeing them from such responsibilities, we create more time for them to focus on complex and valuable engineering investments.

We built an agent to investigate issues in Kubernetes by following the same steps a human would: developing a possible root cause, validating it with logs and metrics, and iterating until a well-supported root cause is found. Given a symptom like “we’re seeing elevated 503 errors”, our agent develops hypotheses as to why this may be the case, such as nginx being misconfigured or application pods being under-resourced. Then, it gathers the necessary information from the cluster to either support or rule out those hypotheses. These results are presented to the engineer as a report with a summary and each hypothesis. It includes all the evidence the agent considered when coming to a conclusion so that an engineer can quickly review and validate the results. With the results of the investigation, an on-call engineer can focus on implementing a fix.

We’ve built an additional agent to automatically execute runbooks when an alert is triggered. It follows steps of a runbook more rigorously than an LLM alone and with more flexibility than workflow automation tools like Temporal. This agent is a combination of separate LLM agents each responsible for a single step of the runbook. Each runbook step agent will execute arbitrary instructions like “look for nginx logs that could explain the 503 error”. A separate LLM will evaluate the results, ensuring the step agent followed the instructions, and determines which subsequent step of the runbook to execute. This allows us to execute runbooks with cycles, retries, and complex branching conditions.

With these tools, we aim to handle the “what’s going wrong” part of on-call for engineers. We still believe it makes the most sense to continue to trust engineers with actually resolving issues as this requires potentially dangerous or irreversible commands. For that reason, our agents exclusively execute read-only commands.

If this sounds like it could be useful for you, we’d love for you to give the product a try! Our service can be installed in your cluster via a helm repo in just a couple of minutes. For our HN launch, we’ve removed the billing requirement for new accounts, so you can test it out on your cluster for free.

We’d love to hear your feedback in the comments!




It would be kind of interesting if, based on an engineer accepting the suggestion, parity generated a new run book.

This would allow repeated issues to be well documented.

On iOS Firefox, when clicking “pricing” on the menu, it scrolls to the proper location, but does not close the menu. Closing the menu causes it to jump to the top of the page. Super annoying.


Agreed, this feature is on our todo list. Another big problem we're aiming to tackle is the tribal knowledge that builds up on teams in part due to a lack of documentation. We want to make it easy to build new runbooks and keep existing runbooks up to date

And thanks for the bug report, I'll take a look


If an issue can be automatically detected and remediated, do you really need a runbook? That space has to be huge. I don't see a purpose for documenting it.

That said, a tool that runs through existing runbooks and improves them or suggests new ones would be extremely useful IMHO.


> I don't see a purpose for documenting it.

Because when it goes wrong you will want to know what it did. When you discover something new, you are going to want to be able to change the runbook. New employees are going to want to learn how things work from the runbook.

Why WOULDN'T you want to document what it is doing? I would never trust an AI that didn't tell me what it was doing and why.


> I don't see a purpose for documenting it.

Enterprises implement stringent Change Management procedures.

If you are making any change to a Prod environment it needs to be thoroughly documented.


Improving documentation.

Keep in mind, they are suggestions. It sounds like the product will automatically execute runbooks but hold suggestions for engineer input. This would move it from “suggestion” to “automatically do X”

Also, sometimes LLMs are wrong.


The product will automatically execute runbooks for you. So far we've focused on using runbooks customers already have, since they know they work for them. We've also added the ability to turn of automatic execution for cases like a suggested runbook, so the customer can make any edits if necessary before approving it to be executed automatically.

Yea, this is a big challenge for us. We're using a variety of strategies to make sure hallucinations are rare, but that's why we're also committed to not executing actions that modify your cluster unless explicitly specified in a runbook


Azure Kubernetes Wrangler (SRE) here, before I turn some LLM loose on my cluster, I need to know what it supports, how it supports it and how I can integrate into my workflow.

Videos show CrashLoopBackOff pod and analyzing logs. This works if Pod is writing to stdout but I've got some stuff doing straight to ElasticSearch. Does LLM speak Elastic Search? How about Log Files in the Pod? (Don't get me started on that nightmare)

You also show fixing by editing YAML in place. That's great except my FluxCD is going revert since you violated principle of "All goes through GitOps". So if you are going to change anything, you need to update the proper git repo. Also in said GitOps is Kustomize so hope you understand all interactions there.

Personally, the stuff that takes most troubleshooting time is Kubernetes infrastructure. Network CNI is acting up. Ingress Controller is missing proper path based routing. NetworkPolicy says No to Pod talking to PostGres Server. CertManager is on strike and certificate has expired. If LLM is quick at identifying those, it has some uses but selling me on "Dev made mistake with Pod Config" is likely not to move the needle because I'm really quick at identifying that.

Maybe I'm not the target market and target market is "Small Dev team that bought Kubernetes without realizing what they were signing up for"


Your comment brings up a good point (and also one of our big challenges): there is a huge diversity in the tools teams use to setup and operate their infra. Right now our platform only speaks to your cluster directly through kubectl commands. We’ll build other integrations so it can communicate with things like Elastic Search to broaden its context as needed, but we’ll have to be somewhat thoughtful in picking the highest ROI integrations to build.

Currently, we only handle the investigation piece and suggest a remediation to the on-call engineer. But to properly move into automatically applying a fix, which we hope to do at some point, we’ll need to integrate into CI/CD

As for the demo example, I agree that the issue itself isn’t the most compelling. We used it as an example since it is easy to visualize and set up for a demo. The agent is capable of investigating more complex issues we've seen in our customer's production clusters, but we're still looking for a way to better simulate these on our test environment, so if you/anyone has ideas we’d love to hear them.

We do think this has more value for engineers/teams with less expertise in k8s, but we think SREs will still find it useful


>we're still looking for a way to better simulate these on our test environment, so if you/anyone has ideas we’d love to hear them.

Pick Kubernetes offering from big 3, deploy it then blow it up.

(I couldn't get HackerNews to format properly and done fighting it)

On Azure, deploy a Kubernetes cluster with following:

Azure CNI with Network Policies

Application Gateway for Containers

External DNS hooked to Azure DNS

Ingress Nginx

Flexible PostGres Server (outside the cluster)

FluxCD/Argo

Something with using Workload Identity

Once all that is configured, put some fake workloads on it and start misconfiguring it with your LLM wired up. When the fireworks start, identify the failures and train your LLM properly.


> we think SREs will still find it useful

There are two kinds of outages: people being idiots and legit hard-to-track-down bugs. SREs worth their salt don't need help with the former. They may find an AI bot somewhat useful to find root cause quicker, but usually not so valuable as to justify paying the kind of price you would need to charge to make your business viable to VCs. As for the latter, good luck collecting enough training data.

Otherwise, you're selling a self-driving car to executives who want the chauffeur without the salary. Sounds like a great idea, until you think about the tail cases. Then you wish you had a chauffeur (or picked up driving skills yourself).

Maybe you'll find a market, but as an SRE, I wouldn't want to sell it.


I basically want to +1 this. OP isn't selling to any place that is already spending six figures on SRE salaries. Actual competitors are companies like Komodor and Robusta who sell "we know Kubernetes better than you" solutions to companies that don't want to spend money on SRE salaries. Companies in this situation should just seriously reconsider hosting on Kubernetes and go back to higher-level managed services like ECS/Fargate, Fly/Railway, etc.


> CertManager is on strike and certificate has expired

Had a good chuckle here, hah.


Same. Typically call it “hung” but maybe stating certmanager is on strike will get the point across better.

But sigh does it really get to the state of the kubernetes ecosystem. All these projects need to be operated, can’t just set it and forget it.


Im sure this is on their roadmap, but honestly a pre-requisite should be a separate piece of software that analyzes and suggests changes to your error handling.

This is a cool proof of concept but almost useless otherwise in a production system

I can already feed Claude or ChatGPT my kubectl output pretty easily

Error handling and logging that are tailored for consumption of a specific pre trained model, thats where this will be ground breaking


The AI needs to be integrated into Dev IDE. All my logging screaming is terrible decisions made by long ago Devs but getting them fixed now is impossible because they don't want to do it and no one is going to make them.


That is something we're working on -- good observability is a place where teams usually fall short and often the limiting factor to better incident response. We're working on logging integrations as a first step.


An AI agent to triage the production issues caused by code generated by some other startup's generative AI bot. I fucking love tech in 2024.


In a gold rush, sell shovel handles, gloves to hold the shovel, newsletter courses on how to dig...


You forget the AI tech which help test the AI tech


Just some feedback on the landing page: ditch the Stanford/MIT/Carnegie Mellon logos. I’m not hating on elite universities or anything, but it has no relevance here (this is not a research project) and I think it detracts from the brand. I don’t associate academia with pager-carrying operators of critical services.


Hard disagree. A solid chunk of the value in attending prestigious universities is being able to to drop names and logos in this exact case.

Like it or not touting that your team has pedigreed educational backgrounds is something you put up front and center.


I think it depends on the project. There is nothing about going to MIT that makes me think you know anything about k8s, openai's API, or running a startup. I would prefer to see what actual work experience the team has(founded company x, veteran PM at FAANG, etc). If they have no experience I wanna see existing customers. If they have none of that I would focus on the product, as this is like a teenager listing their chores as work experience.


> There is nothing about going to MIT that makes me think

That makes you think (no shade)... Question is: does it bother you enough to not proceed in engaging? Likely not. The upside is other audiences that love that kind of stuff - investors, other customers, individuals, etc.

Mostly neutral to positive, little negative.


As noted at the end of my comment, I do think it's negative. If your achievements peaked in undergrad that looks bad. Investors do care about this stuff, but that should be in your pitch to them, on your resume/LinkedIn, etc. Your site should be for targeting customers.


I think this kind of tooling is one positive aspect of integrating LLM tech in certain workflows/pipelines. Tools like k8sgpt are similar in purpose and show a strong potential to be useful. Look forward to seeing how this progresses.


Thanks! k8sgpt is great, it was one of our inspirations


hmmm idk how I would feel about giving an llm cluster access from a security pov


Valid concern, security and safety are essential for anything that can access a production system. We use k8s RBAC to ensure that the access is read-only, so even if the LLM hallucinates and tries to destroy something, it can't

As we will eventually move towards write-access, we're closely following the work in LLM safety. There has been some interesting work to use smaller models to evaluate tool calls/completions against a set of criteria to ensure safety


Other problem is that you become an extremely big target for bad actors as you have read/write (or just even read) access to all these k8s clusters. Obviously you can mitigate against that to a fairly high degree with on prem, but for users not on that...

Cool idea though!


> using AI agents to execute runbooks

This scares me. If I was confident enough in the runbook steps, they'd be automated already by a program. If it's a runbook and not a program, either it's really new or has some subtle nuance around it. "AI" is cool, and humans aren't perfect, but in this scenario I'd still prefer the judgment of a skilled operator who knows the business.

> our agents exclusively execute read-only commands

How is this enforced?

The RCA is the better feature of this tool, in my opinion.


This is a great idea. I use claude for my most of my unknown K8s bugs and it's impressive how useful it is (far more than my coding bugs).


Thanks! We've also been impressed with the performance of out-of-the-box LLMs on this use case. I think in part it is because k8s is a significantly more constrained problem-space than coding, and because of that we'll get to a much more complete solution with the existing state of LLMs than we would for a product like a general software engineer agent.


Well the website seems to be down so I can't actually see any information about what LLM you are using, but I seriously hope you are not just sending the data to OpenAI API or something like that and are forcing the use of a private (ideally self hosted) service.

I would not want any data about my infrastructure sent to a public LLM, regardless of how sanitized things are.

Otherwise, on paper it seems cool. But I worry about getting complicit with this tech. It is going to fail, that is just the reality. We know LLM's will hallucinate and there is not much we can do about it, it is the nature of the tech.

SO it might work most of the time, but when it doesn't and you're bashing your head against the wall trying to figure out what is broken. This system is telling you that all of these things are fine, but one of them actually isn't. But it worked enough times that you trust it, so you don't bother double checking.

That is before we even talk about having this thing running code for automatic remediation, which I hope no one seriously considers ever doing that.


Hmm we're not seeing any issues with the website on our end -- tryparity.com is down for you?

The data security point with LLMs is definitely relevant. There's a broader conversation ongoing right now about how teams will securely use LLMs, but from our conversations so far teams have been willing to adopt the tech. We've been working with startups up to this point, so we'll likely need to offer support self-hosted LLMs if we were to support enterprise or bring-your-own-keys for larger startups.

The hallucination point is interesting. I think a lot of products will need to solve this problem of having so much trust with the user that they'll blindly follow the outputs, but occasionally failing due to hallucination. Our approach has been to 1) only focus on investigation/root cause and 2) make sure it's easy to audit the results by sharing all of the results + supporting evidence


Why would you have your demo video set to "unlisted"? (on what appears to be your official channel) I'd think you'd want to show up in as many places as possible


Congratulations on the launch! I'm curious—how is what you're building different from other AI SRE solutions out there, like Cleric, Onegrep, Resolve, Beeps, and others?


Thanks! Hard to make a comparison to Cleric since their site doesn't really have any features or a demo. Onegrep is a fellow YC company, and we love what they're building! They seem to be more focused on workflows and pulling together context (also a very important problem in the space), we we've put more of a focus on root causing infra issues specifically. Resolve seems to come from the same category as Temporal, which are more traditional automation platforms. These end up being somewhat rigid tools in that you have to very explicitly define each step and they require a certain level of CI/CD or monitoring sophistication to be useful. Using LLMs allows us to relax these requirements and follow workflows like an actual engineer would.

I haven't heard of Beeps and can't find it, could you share the URL?


> They seem to be more focused on workflows and pulling together context (also a very important problem in the space), we we've put more of a focus on root causing infra issues specifically.

So just to clarify, are you saying that Parity is focused on infrastructure issues, while something like Onegrep addresses the broader problem by providing context?

> I haven't heard of Beeps and can't find it, could you share the URL?

https://www.beeps.co/


Yes, my understanding is that Onegrep is meant to provide context from your documentation and past incidents, which can be helpful when trying to solve an alert. We're focused on root-causing underlying infrastructure issues by actually looking into the logs/configurations/metrics.

Ah I actually did see beeps a while back. I haven't tried their product, but they seem to be similar to rootly/Onegrep in that they're working on on-call management/post-mortems


Website won't load - just me?


which website doesn't load for you?


trtparity.com, looks like it's a local problem, loads on cellular.


For god sakes, SREs need to give up on K8. It was a bad idea, just move on.

The answer is not, “let an ai figure it out.”

That is legitimately scary.


What’s more “legitimately scary” is that people wanna run it back to monoliths like it’s the 90s and think building a modern scalable system is too hard


What would you propose as an alternative to k8s?


go back to building monoliths


As a software engineer, DevOps engineer, platform engineer and SRE in a mixed bag, I would say not building monoliths -- instead build a microservice but slightly larger that can still be easily cloneable, scalable and fault tolerant. A mix of monolith and microservice, you may say, and I would like to call that "siloservice".

Silo: A silo is a cylindrical tower used for bulk storage, like grain silos that stand tall near farms. Another kind of silo is harder to see — military silos are underground.

Obviously, you don't need 10 fragmented microservices interdepending on each other, that's one of the biggest overengineering for microservices in real world practices, but you can build multiple "siloservices" that does the same stuff more effectively while getting easy maintenance. I got this inspiration from working with monorepos in the past.


I don't get how that solves the scalability and deployment/sdlc problems k8s is used for

Like, monoliths and k8s seem orthogonal to me. It's like saying 'docker was a failure we should go back to writing Java' to me.


While I agree that there are certainly cases of microservices being used in places they shouldn’t be, I have trouble imagining that monoliths are strictly better in every case. Do you have suggestions for running monoliths at scale?


I think the big problem is it tries to do too much. We used to have many tools as SRE but now teams are really limited. We handed the keys to the engineers which I think is overall a good intention. But we didn’t set them up with sensible defaults, which left them open to making really bad decisions. We made it easy to increase the diversity in the fleet and we removed observability. I think things are more opaque, more complicated, and I have fewer tools to deal with it.

I miss having lots of tools to reach for. Lots of different solutions, depending on where my company was and what they were trying to do.

I don’t think one T-shirt size fits all. But here are some specific things that annoy me.

Puppet had a richer change management language than docker. When I lost puppet, we had to revert back to shitty bash scripts, and nondeterminism from the cicd builds. The worst software in your org is always the build scripts. But now that is the whole host state! So SREs are held captive by nonsense in the cicd box. If you were using Jenkins 1.x, the job config wasn’t even checked in! With puppet I could use git to tell me what config changed, for tracked state anyway. Docker is nice in that the images are consistent, which is a huge pain point with bad puppet code. So it’s a mixed bag.

The clouds and network infrastructure have a lot of old assumptions about hosts/ips/ports. This comes up a lot in network security, and service discovery, and cache infrastructure. Dealing with this in the k8 world is so much harder, and the cost and performance so much worse. It’s really shocking to me how much people pay because they are using these software based networks.

The Hypervisors and native cloud solutions were much better at noisy neighbor protection, and a better abstraction for carving up workloads. When I worked at AWS I got to see the huge lengths the ebs and ec2 teams put into providing consistent performance. VMWare has also done a ton of work on QoS. The os kernels are just a lot less mature on this. Running in the cloud inside a single vm removed most of the value of this work.

In the early 2010s, lots of teams were provisioning ec2 instances and their bills were easy to see in the bill as dollars and cents. At my last company, we were describing workloads as replicas/gbs/cpus/clusters on a huge shared cluster. Thousands of hosts, a dozen data centers.

This added layer of obfuscation hides true cost of a workload. I watched a presentation from a large well known software service company say that their k8 migration increased their cloud spend because teams were no longer accountable to spend. At my company, I saw the same thing. Engineers were given the keys on provisioning but were not in the loop for cost cutting. That fell to the SREs, who were blamed for exploding costs. The engineers are really just not prepared to handle this kind of work. They have no understanding about the implications in terms of cost and performance. We didn’t train them on these things. But we took the keys away from the SRE’s and handed it to the engineers.

The debugging story is particularly weak. Once we shipped on docker and K8 we lost ssh access to production. 10 years into the docker experiment, we now have a generation of senior engineers who don’t know how to debug. I’ve spent dozens of hours on conference calls while the engineers fumbled around. Most of these issues could have been diagnosed with netstat/lsof/perl -pe/ping/traceroute. If the issue didn’t appear in New Relic, then they were totally helpless. The loss of the bash one-liner is really detrimental to engineers progress.

There is too much diversity in the docker base images and too many of them stuck. The tool encourages every engineer to pick a different one. To solve this my org promised to converge on alpine. But if you use a docker distribution, now you are shipping all of user mode to every process. I was on the hook for fixing a libc exploit for our fleet. I had everyone on a common base image, so fixing all 80 of my host classes took me about a few days. But my coworkers in other orgs who had hundreds of different docker images were working on it a year later. Answering the question, which LibC am I on became very difficult.

Terraform has a better provisioning/migration story. Use that to design your network, perform migrations. Use the cloud native networking constructs. Use them for security boundaries. Having workloads move seamlessly between these “anything can be on me hosts” make security, a real nightmare.

I left being an SRE behind when I saw management get convinced docker/k8 was a cancer treatment, a desert topping and a floor wax. it’s been five years and I think I made the right call.


> This agent is a combination of separate LLM agents each responsible for a single step of the runbook

Someone needs to explain to me how this is expected to work.

Percentage of Hallucinations/Errors x Steps in Runbook = Total Errors

0.05 x 10 = 0.5 = 50%


Getting tired of seeing this concept of practically guaranteed hallucinations from any LLM used in production. I've used LLMs for various tasks, and if you tune your system correctly, it can be very reliable. It's just not always plug-and-play reliability. You need to set up your fine-tuning and prompts and then test well for consistent results.


> You need to set up your fine-tuning and prompts and then test well for consistent results.

Tell that to Google...

Seriously, it is well established that these systems hallucinate. Trying to say otherwise shows you are trying to push something that just is not true.

They can be right, yes. But when they are wrong they can be catastrophically wrong. You could be wasting time looking into the wrong problem with something like this.


If you're curious what the state of the art in multi-agent is looking like, I really recommend https://thinkwee.top/multiagent_ebook/


This looks great! Unfortunately doesn't well on firefox but I take it as being Mozilla's fault nowadays.


The most common hallucinations I've seen are phantom GitHub repos and issues, and this usually appears when I ask for a source.


It's definitely an overblown problem. In practice it's not a big issue.


Yeah.... no it's really not overblown.

It is a serious problem when these tools are being pushed as trustworthy when they are anything but.

On an almost daily occurrence I deal with some sort of hallucination in code, in summarizing something, we see it constantly on social media when people try to use Google's AI summary as a source of truth.

Let's not try to lie to push an agenda about what the capabilities of what these models can do. They are very powerful, but they make mistakes. There is zero question about that, and quite often.

The problem isn't that they hallucinate, the problem is that we have comments like yours trying to downplay it. Then we have people that, it is right just enough times that they start trusting it without double checking.

That is the problem, it is right enough times that you just start accepting the answers. That leads to, making scripts that grab data and put it into a database without checking. It's fine if it is not business critical data, but it's not really fine when we are talking about health care data or.. oh idk, police records like a recent post was talking about.

If you are going to use it for your silly little project, or you're going to bring down your own companies infrastructure go for it. But let's not pretend the problem doesn't exist and shove this technology into far more sensitive areas.


Yeah.... no it's overblown.

I think you're exaggerating. You're imagining the worst but your argument basically boils down to not trusting that people can handle it, and calling me a liar. Good one.


> Tell that to Google...

Yeah, because Google's LLMs have an completely open question/answer space.

For e.g. a Kubernetes AI, you can nowadays just feed in the whole Kubernetes docs + a few reference Helm charts, tell it to stick close to the material, and you'll have next to no hallucinations. Same thing for simple data extraction tasks, where in the past you couldn't use LLMs because they would just hallucinate data into the output that wasn't there in the input (e.g. completely mangling an ID), which nowadays is essentially a non-issue.

As soon as you have a restrictable space in which the LLM acts, you have a lot of options to tune them that hallucinations are not a major issue nowadays.


Every LLM conversation is guaranteed to contain some level of hallucination. You will never get the percentage down to zero.


> it can be very reliable

You need to quantify this. With actual numbers.

I am getting very tired of seeing everyone pushing LLMs and being disingenuous about exactly how often it is getting things wrong. And what the impact of that is. There is a reason that AI is not taking off in the enterprise and that is because people who take their job seriously are getting tired too.


I understand your sentiment, but I also don't think it's fair to say people are being disingenuous. I don't work for an AI company, I just use it with decent results.

My last project needed a way to tag certain types of business activity indicated in 17,000 user reviews. I wrote a prompt with 5 different tags along with descriptions for each business activity, took a few-shot approach by defining 8 different examples and how I would tag them, and then ran the tagging prompt on batches of indexed reviews, giving it 100 reviews per batch. I did a random sampling of about 200 items, and the success rate was roughly 89%. I could have improved it by doing more iterations, and possibly fine-tuning if I felt that it was worth it.

In every run, it generated matching results for the input JSON in a corresponding output JSON with no errors in any run.

That's the only example I have numbers on off the top of my head.

EDIT: This was using `gpt-4o-2024-05-13`


(Reader beware: basically everything the LLM emits in here is wrong.)

I was attempting to find formats that used a particular quoting style — SQL style double-the-character-to-espace — and bizarrely JSON came up.

Gemini:

> JSON: In JSON, double quotes are used to delimit strings. If a double quote needs to be included within the string itself, it's escaped by doubling it (e.g., "This is a string with a "quote" in it.").

That is,

  "This is a string with a "quote" in it."
Me: Does your JSON example parse as valid JSON?

Gemini:

> Yes, the JSON example I provided ("This is a string with a "quote" in it.") is actually valid JSON.

> The reason it's valid is because the doubled quote within the string is interpreted as an escaped character, representing a single literal quote within the string itself. The JSON parser recognizes the double quotes at the beginning and end as delimiters for the entire string, not the escaped quote within.

It also called out,

> XML: Use a backslash (\) to escape characters, including quotes.

In a different context, and more towards "working with k8s" that the OP's AI is focused on,

Me: Regex to extract the replica number from a statefulset pod's name?

Gemini:

> Use kubectl: The kubectl get pods command with output formatting can directly show replica numbers. Here's an example:

  kubectl get pods -o jsonpath='{.items[*].metadata.name}{" - replica: "}{.metadata.labels.replicaset-序号}'
The Chinese here, as best as I can tell, reads "serial number" … which sort of contextually makes sense in the sense of an LLM, I guess. The command is utter nonsense, of course.


My absolute favorite is when the model will tell you why something should be one way, but in the same message confidently tell you it is the other way.

I had a situation where I was playing around with having it right a quick method for me. I gave it the specs it needed to work with and the other methods it will call.

Basically, it wrote calling one of the methods with the wrong (missing) arguments. When I asked it to fix that call, it first said its current call was right but then when it went to explain it (in the same message), it pointed out that a certain missing argument was required to be passed in.

I have seen similar situations with these models looking at data time and time again. It theoretically sees it, but once that wrong data (the first failed interpretation) is part of its context it's basically all over.


Gemini is apparently hot garbage. I passed this input into ChatGPT (GPT-4o) and I got sensible answers.


Actual number: 0. I haven't seen a single hallucination in my workloads since GPT-4 but the task is very specific, as it should be.


We separated out the runbook such that each step is a separate LLM in the agent. Between each step, there's sort of a "supervisor" that ensures that the step was completed correctly, and then routes to another step based on the results. So in reality, a single step failing requires two hallucinations. Hallucinations are also not a fixed percentage across all calls -- you can make them less likely by maintaining focused goals (this is why we made runbooks agentic rather than a single long conversation)


That second step hallucinating is far more likely when you are feeding it incorrect information from the first hallucination.

LLM's are very easy to manipulate.

At one point with a system prompt telling Claude it was OpenAI, I was able to ask what its model is and it would confidently tell me it was OpenAI. Garbage data in, garbage data out.

Admittedly that is an extreme case, but you're giving that second prompt wrong data in the hopes that it will identify it instead of just thinking it's fine when it is part of its new context.


yea. We're definitely concerned about hallucinations and are using a variety of techniques to try and mitigate it (there's some existing discussion here, but using committees and sub-agents responsible for smaller tasks has helped).

What's helped the most, though, is using cluster information to back up decision making. That way we know the data it's considering isn't garbage, and the outputs are backed up by actual data.


And what is your average error rate per runbook step.


one thing we're experimenting to help with the hallucinations/error rate issue is using a committee framework where we take a majority vote.

If the error rate of 1 expert is 5%, then for a committee of 10 experts, the probability a majority of the committee errors is around 0.00276% (binomial distribution with p=0.05). For 10 steps, this would be an error rate of 0.0276%


Pretty bad maths there. Those committee members are not independent.

They are highly correlated even amongst LLMs from different vendors.


I'm not sure they are highly correlated. A committee uses the same LLM with the same input context to generate different outputs. Given the same context LLMs should produce the same next token output distribution (assuming fixed model parameters, temperature, etc). So, while tokens in a specific output are highly correlated, complete outputs should be independent since they are generated independently from the same distribution. You are right they are not iid but the calculation was just a simplification.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: