That's only one of many definitions for the word agent outside of the context of AI. Another is something produces effects on the world. Another is something that has agency.
Sort of interesting that we've coalesced on this term that has many definitions, sometimes conflicting, but where many of the definitions vaguely fit into what an "AI Agent" could be for a given person.
But in the context of AI, Agent as Anthropic defines it is an appropriate word because it is a thing that has agency.
It would only be circular if agency was only defined as “the property of being an agent”. That circle of reasoning isn’t being proposed as the formal definitions by anyone.
Perhaps you mean tautological. In which case, an agent having agency would be an informal tautology. A relationship so basic to the subject matter that it essentially must be true. Which would be the strongest possible type of argument.
Or the test wasn't testing anything meaningful, which IMO is what happened here. I think ARC was basically looking at the distribution of what AI is capable of, picked an area that it was bad at and no one had cared enough to go solve, and put together a benchmark. And then we got good at it because someone cared and we had a measurement. Which is essentially the goal of ARC.
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.
Id agree with you if there hasn’t been very deliberate work towards solving ARC for years, and if thr conceit of the benchmark wasn’t specifically based on a conception of human intuition being, put simply, learning and applying out of distribution rules on the fly. ARC wasn’t some arbitrary inverse set, it was designed to benchmark a fundamental capability of general intelligence
For some subset I am certain AI is just the latest excuse to justify bad existing business practices. Much safer to say the world has changed due to AI and that's why you're changing your hiring plan instead of "our plan was bad and we dramatically overhired and misjudged the post-covid world"
Sure. But then they added "maybe they should be". Yeah, it's a clickbait title not backed up by the content, but I think it's fair to criticize the words the author chose to write.
I appreciate the blog post and I learned a bunch from it!
But the quote that comes to mind is: "People who say it cannot be done should not interrupt those who are doing it"
It's quite a stretch to say that JSON is broken and can't be fixed. JSON works exceptionally well in practice and any review of it that fails to acknowledge this is, IMO, very lacking in perspective.
If you look at the statistics, you’ll find that casual, condomless sex is very common. Despite the risks of STI transmission, it often pans out okay: most STIs are readily curable, while HIV has a low transmission rate amongst heterosexual couples (about 0.08% to 0.2% per sexual act) while PrEP is pretty common for homosexual men, and HSV has fairly low odds of transmission outside of an active outbreak (annual odds of transmission amongst couples: about 11–17% for couples with a male source partner and 3–4% with a female partner). Consequently, many people who have casual, condomless sex with many, many partners will end up just fine.
Is it reckless? Well, the connotation of “reckless” is fairly negative, and I don’t see why I should get to judge what two consenting adults decide is an acceptable risk vs reward for themselves (at least, so long as I’m not involved).
However, when we pivot back to the domain of software/engineering: when software design choices are made via bandwagon fallacy, claiming that “it usually works out okay”, I do find that to be reckless. You may be fine with such a carefree approach, but it isn’t really fair to other engineers, users, and stake holders.
It’s not that I believe JSON (or similarly pitfall-laden technology) should be strictly avoided, but rather that the risks and failure modes should be given serious consideration, rather than minimized or outright dismissed. In terms of the STI analogy, it’s perfectly reasonable for two individuals to be aware of the risks, and depending on their appetite for said risk, agree upon the inclusive/exclusive nature of their relationship, as well as whether they exchange test results; what would be ridiculous is to pretend that the risk of infection is zero.
There would be zero value in the article minimizing the design flaws (and resulting footguns) in JSON. The insistence that the article should do so is about as bizarre as someone responding to an article on safer sex by insisting that the article be amended with a note that “… but raw-dogging strangers isn’t too risky anyway, so, like, YOLO”.
I'm not sure if I should be impressed by your detailed knowledge of STI statistics or unnerved by your attempt to analogize unprotected sex to use of JSON. I'll split the difference and say, "what?"
Hey, if you like living your life uninformed of the risks of your actions (or maybe it’s that the risks are irrelevant to you, if you’re not getting any?), you do you.
I don’t know how I can help you understand the utility of analogies (you recognized the rhetoric as such, yet simultaneously seem stuck on the fact that analogies don’t establish a literal connection). However, I’m not sure if that’s actually the problem, or if you’re feigning confusion in attempt to offend me. Granted, it would seem an odd choice for you to claim incompetence as part of an attempted insult.
Regardless, I’d be happy to help you in any way I can.
They aren't claiming incompetence and you're being rude.
Ironically there is some projection here about playing dumb since you're pretending you don't understand why getting and spreading sexually transmitted disease is a poor analogy to "using json".
A good analogy needs more than just a slight parallel - it should have lots of significant parallels (and few significant misses) that help you think about the shape of the topic being discussed.
You can hate json but pretending it is similar to infectious disease in cause, usage, value, solution, or expression makes it really ineffective at bringing people to your side since it's really hard to see any useful similarities which makes it seem like the whole point is just fluffed up "json bad" and "people not me are dumb".
If you're actually happy to help in any way you can then stop being patronizing and either walk back your overly incendiary first shots or take a good faith attempt to clarify after someone pushes back on an intentionally extreme example.
> They aren't claiming incompetence and you're being rude.
No they weren't.
> Ironically there is some projection here about playing dumb since you're pretending you don't understand why getting and spreading sexually transmitted disease is a poor analogy to "using json".
I don't understand why it would be a poor analogy, and I'm not pretending. (And I hope I'm not too dumb for real.) Please explain.
> You can hate json but pretending it is similar to infectious disease in cause, usage, value, solution, or expression makes it really ineffective at bringing people to your side since it's really hard to see any useful similarities
The obvious similarity I saw was "this has risks. Not huge risks, the sky isn't falling, but definitely not zero either, so it's useful to be informed about and explicitly acknowledge and weigh the risks." Am I dumb for seeing this similarity? Or for not finding it rude?
For someone who wrote such an overwrought analogy, you should know that when you make an analogy, you can't analogize something that bakes in a conclusion on the thing up for debate.
The problem with your analogy is that you chose something with serious undeniable consequences: STIs. The analogy doesn't work because the person above doesn't grant the severity of consequences of using JSON.
So if you like health analogies, replace STIs with the common cold and explain how scary it is even though we've had the common cold hundreds of times and are well aware of its severity.
It's a test in the sense that it's meant to validate functionality. You're correct there.
The endpoints we poke at are provided as context when creating the application.
Our approach evolved to be more liberal in what was required to pass. So instead of looking for an HTML element with id="foo" we accept a 200 HTTP response code. It's subtle but had a huge improvement in the end user experience.
You should care about counterparty risks. If your business model depends on unsustainable 3rd party prices powered by VC largesse and unrealizable dreams of dominance, the very least you can do is plan for the impending reckoning, after which GPU proces will be determined by costs.
Look, I understand that some people are short-sighted and can hardly think out of the box and that is totally fine by me. I don't judge you for being that so I kindly ask you not to judge my question. Learn to give some benefit of the doubt.
LLMs have dramatically different results depending on the domain. Getting LLMs to help me learn typescript is a joy, getting them to help me fix distributed consensus problems in my fully bespoke codebase make them look worse than useless.
Some people will find them amazing, some will find them a net negative.
Although finding truly zero use for them makes it hard for me to believe that this person really tried with creativity and an open mind
Very much this, I have > 25 years programming experience, but not with typescript and react, it’s helping me with my current project. I ignore probably 2/3 of its auto suggestions, but increasingly I now highlight some code and ask it to just do x for me, rather having to go google the right function/ css magic
I'm fine with it, I've forgotten more frameworks and libs for now dead devices/services/OS etc over the years that it's largely pointless memorising these things, I'm very happy for a machine to help me get to where I want to be, and less time faffing about with google/stackoverflow the better, like I said the failure rate is still fairly high, but still useful enough.
> getting them to help me fix distributed consensus problems in my fully bespoke codebase make them look worse than useless.
Often the complex context of such problem is more clear in your head than you can write down. No wonder the LLM cannot solve it, it has not the right info on the problem. But if you then suggest to it: what if it had to do with this or that race condition since service A does not know the end time of service Z, it can often come up with different search strategies to find that out.
It's an open secret that we have no idea how to actually measure tech company productivity. That's why there isn't and will not be clear evidence for or against RTO.
Best you can do is pick a narrow enough sliver that it is measurable. Then claim it is the "important" view and wow, what a shock, the data supports your position!
I agree on an individual level, but at a company level it’s fairly easy to measure things like product feature shipping velocity, change in business metrics like growth etc. When you’re talking about company wide changes like RTO it would theoretically show up in these core metrics.
Sort of interesting that we've coalesced on this term that has many definitions, sometimes conflicting, but where many of the definitions vaguely fit into what an "AI Agent" could be for a given person.
But in the context of AI, Agent as Anthropic defines it is an appropriate word because it is a thing that has agency.
reply