They should define this, but after having read the entire article I think it’s clear they mean “frameworks for evaluating the output of an agent” rather than what first might come to mind as “LLM evals”.
Their thesis is that even when the eval is useless for correctness of a single agentic action in production, it allows you to choose between two agents by cross-comparing in a large aggregated collection of tasks. Effectively: you can tune your agentic parameters.
Nothing new to the idea that taking many samples and averaging can work when a single datapoint doesn’t. Presumably this is part of a conversation in which we’re lacking context.
"LLM evals" is maybe an overused term because it can mean a bunch of things. This article talks about LLM-as-a-judge where an LLM scores another system's outputs.
The poorest of the poor, subsistence farmers are barely producing enough to feed themselves; they trade and barter the little bit they can manage but it is not much and has little impact that goes beyond a tiny village-level radius. Nobody is displacing that because nobody needs to compete with that.
Very commonly governments will displace these poor to build factories or expand large scale farming practices with international development deals. The land gets taken from the poor and they are left with little choice but to work in these new factories often in abusive conditions.
> Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit.
That’s not evident in what you pastedat all.
What you pasted says
> sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior […] We make no technical difference between these […] sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
From this one infers that the "180 were sec-high" bugs found are actually exploitsble but known to have been found in the wild, and are NOT mere annoying bugs.
The difference between 180 and 270 does nothing to deflate the signicance, or lack there of, of the implication re: Mythos.
> This also seems like a win for society, if there is some sort of pattern with ai helping with crimes.
That fails to recognize the tradeoff between freedom and security. Society suffers if we, for instance, lock everyone up, despite the reduction that would have in crimes. The balance between the two cannot be ignored to justify outcomes, though it is American tradition to value liberty over security when the two come in conflict.
On many occasions, I have been told to “be more empathetic.”
When I ask why, I typically get this reaction:
This is a ridiculous question. I am not going to answer it because it is so ridiculous.
Empathy is the right thing to do! You should feel bad for that person. We’re humans, after all.
These explanations never really helped.
*
Even after reading this, I am not sure the author really gets what is behind the request.
From FBI negotiator Chris Voss' book, Never Split the Difference:
> There is nothing more frustrating or disruptive to any
negotiation than to get the feeling you are talking to
someone who isn't listening. Playing dumb is a valid
negotiating technique, and “I don't understand” is a
legitimate response. But ignoring the other party's position
only builds up frustration and makes them less likely to do
what you want.
> The opposite of that is tactical empathy.
> In my negotiating course, I tell my students that empathy
is “the ability to recognize the perspective of a counterpart,
and the vocalization of that recognition.” That's an
academic way of saying that empathy is paying attention to
another human being, asking what they are feeling, and
making a commitment to understanding their world.
Notice I didn't say anything about agreeing with the
other person's values and beliefs or giving out hugs. That's
sympathy. What I'm talking about is trying to understand a
situation from another person's perspective.
---
The respondent to the author is ironically showing why empathy is so important. By being non-empathetic and shutting down the question as "stupid", the author is bound to feel the respondent doesn't care to understand their position. If the respondent really cared about having the author understand their position, they would have first shown that they will try to understand the author's, even if they don't agree with it.
This is also the driver behind so much of the toxicity of modern politics. All the snark, condescension, and contempt just sets up a feedback loop that drives people even further away from each other.
Those seem like particularly bad reasons. I'm not sure if they are the arguments that the author has been given or if that's what he perceived those arguments to be.
My take on it is to remember that the people you are talking to are real people with reasons for doing things. Very few people do things that they think are wrong at the time of doing them.
I would guess that the single most common cause of bad faith arguments comes from people jumping to the conclusion that the person they are dealing with is acting in bad faith.
Reflecting on it some more perhaps you can boil it down to the implications of dealing with real people.
If you don't act with empathy you can hurt people. Is it your intention to hurt people?
If it turns out your motivation is, in fact, to hurt people then the issue isn't empathy but your own motivations. Reflecting on your motivations and what you feel like you should be doing as a person is the path to take here.
>Those seem like particularly bad reasons. I'm not sure if they are the arguments that the author has been given or if that's what he perceived those arguments to be.
I think this might have to just be axiomatic. At the bottom of every system is an axiom, whether it's identity in mathematics or "we hold these truths to be self-evident, that all men are created equal" in USican politics or "empathy is good and to be pursued" in interpersonal relationships.
>If you don't act with empathy you can hurt people. Is it your intention to hurt people?
I dare say that if your mission is actually to hurt people as much as you can empathy will help you a lot in that goal because it lets you define strategies tailored to hurt the target based on their feelings. Without empathy you're limited to thinking about what would hurt you and then applying it to other people.
I believe the point they are making is that they believe the original author still does not "get it". I'm inclined to agree.
Recognizing that when people say one lacks empathy and rejecting what one believes they might mean by that and instead reinterpreting empathy to mean something they want it to mean is fundamentally a demonstration of a lack of empathy in what was likely the original context. Even if new interpretation technically aligns with the dictionary definition.
I want to be clear that recommendations that the post make are helpful and seeing the world as best one can from some others point of view is worthwhile.
At a fundamental level though saying I see your definition of empathy and reject it for my definition which I'll be happy to try to live by, while noble, likely is directly contrary to both parity's use of the term.
Impossible to say what was behind any specific request, but what is generally meant by “Have a little emapathy” and its kin is : “Stop criticizingjudging/etc. or communicating with the individual being discussed that sharply, because we feel the individual has good reasons/a good excuse/a good justification for sympathy and/or some leniency here.”
I think the author understands that "have a little empathy" is a request to modify their behavior, but expresses their frustration that the request is unclear and (in the author's experience) the requestors won't clarify the request.
Contrast this with the conversation we're having now where I requested clarification on your initial comment, and you thoughtfully provided it.
But to your point, there's an innate "getting it" where your level/expression of empathy is roughly in-line with people you interact with, and if you don't have that, you need to do work to "get it", which is what the author did.
We can compare empathy to other practices that can benefit from innate understanding. Some people "get" poetry, math, music, long-distance running, etc. and we can all work to "get" them, but in my experience, it's never quite the same.
I _don’t_ think that empathy has anything to do with it though.
Behaviour modification yes, but that is “stop talking so critically”. Or “don’t be so harsh” or “give this person special treatment”. WHEN to do that might be key here—perhaps the colleague’s husband has cancer, or their child missed school 3 days this week with the flu, or their project wasn’t productionalized/their new to the role/etc—and so a blanket “don’t talk so harshly” isn’t called for—instead what is really desired is social calibration.
But instead it seems everyone is getting caught up on the literal interpretation of this figure of speech instead.
_May_ be a case for extending out what has been explored by theory to cover more useful ground (or not, depending on whether real-world usecases like yours are too heterogenous for effective general techniques).
100%. Not sure why you’re downvoted here, there’s nothing controversial here even if you disagree with the framing.
I would go on to say that thisminteraction between ‘holes’ exposed by LLM expectations _and_ demonstrated museerbase interest _and_ expert input (by the devs’ decision to implement changes) is an ideal outcome that would not have occurred if each of the pieces were not in place to facilitate these interactions, and there’s probably something here to learn from and expand on in the age of LLMs altering user experiences.
Their thesis is that even when the eval is useless for correctness of a single agentic action in production, it allows you to choose between two agents by cross-comparing in a large aggregated collection of tasks. Effectively: you can tune your agentic parameters.
Nothing new to the idea that taking many samples and averaging can work when a single datapoint doesn’t. Presumably this is part of a conversation in which we’re lacking context.
reply