great libraries already exist for usage tracking (like ccusage), think of this as just a social layer on top so you can follow other ppl's claude code sessions.
agree landing on global feed vs. "following" makes it more clear, will make that change now.
re: gaming the system, the cli tool pulls from local data usage data files to "verify". while this might be possible to spoof, it'd be a lot of work for not very much gain :)
The inherent problem with evaluating coding performance of models remains: most day-to-day coding tasks are open-ended/partially-spec'd, and as such there is huge uncertainty on how the "right" solution looks.
It's very hard to rank models' solutions on such problems, which is why they rarely appear in benchmarks (I'd be glad to stand corrected).
Even Opus 4.5 coding a C compiler from scratch - jaw-dropping as it is - doesn't tell the whole story. Most of my tasks are not that well spec'd.
Yes, it seems the open benchmark results that are normally reported, such as SWE-bench, SWE-bench Verified, and Terminal-bench, aren't really that indicative of success in more general use cases.
According to Gemini, SWE-bench is actually a very narrow test, consisting of fixing GitHub issues drawn from 12 large Python projects (with Verified being a curated subset of that), and Terminal-bench (basically agentic computer tool use) is more focused on general case rather than use of the tools used by a typical coding agent such as Claude Code, Codex CLI or Gemini CLI.
Salesforce, like every large enterprise software company, has a formal (and strict) End of Life process. It starts with an announcement like this indicating End of Sales, then once the contract obligations are met, they can end support, then EoL.
There is no way they can avoid this kind of public notice.
My second reaction: still incredible, but noting that a C compiler is one of the most rigorously specified pieces of software out there. The spec is precise, the expected behavior is well-defined, and test cases are unambiguous.
I'm curious how well this translates to the kind of work most of us do day-to-day where requirements are fuzzy, many edge cases are discovered on the go, and what we want to build is a moving target.
This is the key: the more you constrain the LLM, the better it will perform. At least that's my experience with Claude. When working with existing code, the better the code to begin with, the better Claude performs, while if the code has issues then Claude can end up spinning its wheels.
Yes I think any codegen with a lot of tests and verification is more about “fitting” to the tests. Like fitting an ML model. It’s model training, not coding.
But a lot of programming we discover correctness as we go, one reason humans don’t completely exit the loop. We need to see and build tests as we go, giving them particular care and attention to ensure they test what matters.
Important: I didn't see opus 4.6 in claude code. I have native install (which is the recommended instllation). So, I re-run the installation command and, voila, I have it now (v 2.1.32)
We'll see. The first two things that they said would move from "emerging tech" to "currently exists" by April 2026 are:
- "Someone you know has an AI boyfriend"
- "Generalist agent AIs that can function as a personal secretary"
I'd be curious how many people know someone that is sincerely in a relationship with an AI.
And also I'd love to know anyone that has honestly replaced their human assistant / secretary with an AI agent. I have an assistant, they're much more valuable beyond rote input-output tasks... Also I encourage my assistant to use LLMs when they can be useful like for supplementing research tasks.
Fundamentally though, I just don't think any AI agents I've seen can legitimately function as a personal secretary.
Also they said by April 2026:
> 22,000 Reliable Agent copies thinking at 13x human speed
And when moving from "Dec 2025" to "Apr 2026" they switch "Unreliable Agent" to "Reliable Agent". So again, we'll see. I'm very doubtful given the whole OpenClaw mess. Nothing about that says "two months away from reliable".
There are plenty of companies that sell an AI assistant that answers the phone as a service, they just aren't named OpenAI or Anthropic. They'll let callers book an appointment onto your calendar, even!
No, there are companies that sell voice activated phone trees, but no one is getting results out of unstructured, arbitrary phone call answering with actions taken by an LLM.
I'm sure there are research demos in big companies, I'm sure some AI bro has done this with the Twilio API, but no one is seriously doing this.
All it takes is one "can you take this to the post office", the simplest, of requests, and you're in a dead end of at best refusal, but more likely role-play.
Agreed that “unstructured arbitrary phone calls + arbitrary actions” is where things go to die.
What does work in production (at least for SMB/customer-support style calls) is making the problem less magical:
1) narrow domain + explicit capabilities (book/reschedule/cancel, take a message, basic FAQs)
2) strict tool whitelist + typed schemas + confirmations for side effects
3) robust out-of-scope detection + graceful handoff (“I can’t do that, but I can X/Y/Z”)
4) real logs + eval/test harnesses so regressions get caught
Once you do that, you can get genuinely useful outcomes without the role-play traps you’re describing.
We’ve been building this at eboo.ai (voice agents for businesses). If you’re curious, happy to share the guardrails/eval setup we’ve found most effective.
It's important to remember though (this is besides the point for what you're saying) that job displacement of things like secretaries from AI do not require it to be a near perfect replacement. There are many other factors (for example if it's much cheaper and can do part of the work it can dramatically shrink demand as people can shift to an imperfect replacement in AI)
I think they immediately corrected their median timelines for takeoff to 2028 upon releasing the article (I believe there was a math mistake or something initially), so all those dates can probably be bumped back a few months. Regardless, the trend seems fairly on track.
People have been in love with machines for a long time. It's just that the machines didn't talk back so we didn't grant them the "partner" status. Wait for car+LLM and you'll have a killer combo.
Only on HN will people still doubt what is happening right in front of their eyes. I understand that putting things into perspective is important, still, the type of downplaying we can see in the comments here is not only funny but also has a dangerous dimension to it. Ironically, these are the exact same people who will claim "we should have prepared better!" once the effects become more and more visible. Dear super engineers, while I feel sorry that your job and passion become a commodity right in front of you, please stay out the way.
Scott Alexander essentially provided editing and promotion for AI 2027 (and did a great job of it, I might add). Are you unaware of the actual researchers behind the forecasting/modelling work behind it, and you thought it was actually all done by a blogger? Or are you just being dismissive for fun?
There's still no evidence we'll have any take off. At least in the "Foom!" sense of LLMs independently improving themselves iteratively to substantial new levels being reliably sustained over many generations.
To be clear, I think LLMs are valuable and will continue to significantly improve. But self-sustaining runaway positive feedback loops delivering exponential improvements resulting in leaps of tangible, real-world utility is a substantially different hypothesis. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.
If only General Relativity had such an ironclad defense of being as unfalsifiable as Foom Hypothesis is. We could’ve avoided all of the quantum physics nonsense.
it doesn't mean it's unfalsifiable - it's a prediction about the future so you can falsify it when there's a bound on when it is going to happen. it just means there's little to no warning. I think it's a significant risk to AI progress that it can reach some sort of improvement speed > speed of warning or any threats from AI improvement
This has already been going on for years. It's just that they were using GPT 4.5 to work on GPT 5. All this announcement mean is that they're confident enough in early GPT 5.3 model output to further refine GPT 5.3 based on initial 5.3. But yes, takeoff will still happen because of this recursive self improvement works, it's just that we're already past the inception point.
I think it's important in AI discussions to reason correctly from fundamentals and not disregard possibilities simply because they seem like fiction/absurd. If the reasoning is sound, it could well happen.
Intelligence might be more like an optimization problem, fitting inputs to optimal outputs. Sometimes reality is simply too chaotic to model precisely so there is a limit to how good that optimization can be.
It would be like distance to the top of a mountain. Even if someone is 10x closer, they could still only be within arms reach.
making the specifications is still hard, and checking how well results match against specifications is still hard.
i dont think the model will figure that out on its own, because the human in the loop is the verification method for saying if its doing better or not, and more importantly, defining better
Impressive results, but I keep coming back to a question: are there modes of thinking that fundamentally require something other than what current LLM architectures do?
Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."
I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.
When I first started coding with LLMs, I could show a bug to an LLM and it would start to bugfix it, and very quickly would fall down a path of "I've got it! This is it! No wait, the print command here isn't working because an electron beam was pointed at the computer".
Nowadays, I have often seen LLMs (Opus 4.5) give up on their original ideas and assumptions. Sometimes I tell them what I think the problem is, and they look at it, test it out, and decide I was wrong (and I was).
There are still times where they get stuck on an idea, but they are becoming increasingly rare.
Therefore, think that modern LLMs clearly are already able to question their assumptions and notice when framing is wrong. In fact, they've been invaluable to me in fixing complicated bugs in minutes instead of hours because of how much they tend to question many assumptions and throw out hypotheses. They've helped _me_ question some of my assumptions.
They're inconsistent, but they have been doing this. Even to my surprise.
agree on that and the speed is fantastic with them, and also that the dynamics of questioning the current session's assumptions has gotten way better.
yet - given an existing codebase (even not huge) they often won't suggest "we need to restructure this part differently to solve this bug". Instead they tend to push forward.
> These feel like they involve something beyond "predict the next token really well, with a reasoning trace."
I don't think there's anything you can't do by "predicting the next token really well". It's an extremely powerful and extremely general mechanism. Saying there must be "something beyond that" is a bit like saying physical atoms can't be enough to implement thought and there must be something beyond the physical. It underestimates the nearly unlimited power of the paradigm.
Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions? What else than a sequence of these tokens would a machine have to produce in response to its environment and memory?
The point is that "predicting the next token" is such a general mechanism as to be meaningless. We say that LLMs are "just" predicting the next token, as if this somehow explained all there was to them. It doesn't, not any more than "the brain is made out of atoms" explains the brain, or "it's a list of lists" explains a Lisp program. It's a platitude.
In the case of LLMs, "prediction" is overselling it somewhat. They are token sequence generators. Calling these sequences "predictions" vaguely corresponds to our own intent with respect to training these machines, because we use the value of the next token as a signal to either reinforce or get away from the current behavior. But there's nothing intrinsic in the inference math that says they are predictors, and we typically run inference with a high enough temperature that we don't actually generate the max likelihood tokens anyway.
The whole terminology around these things is hopelessly confused.
I mean.. i don't think that statement is far off. Much of what we do is entirely about predicting the world around us, no? Physics (where the ball will land) to emotional state of others based on our actions (theory of mind), we operate very heavily based on a predictive model of the world around us.
Couple that with all the automatic processes in our mind (filled in blanks that we didn't observe, yet will be convinced we did observe them), hormone states that drastically affect our thoughts and actions..
and the result? I'm not a big believer in our uniqueness or level of autonomy as so many think we have.
With that said i am in no way saying LLMs are even close to us, or are even remotely close to the right implementation to be close to us. The level of complexity in our "stack" alone dwarfs LLMs. I'm not even sure LLMs are up to a worms brain yet.
> Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself.
Have you tried actually prompting this? It works.
They can give you lots of creative options about how to redefine a problem space, with potential pros and cons of different approaches, and then you can further prompt to investigate them more deeply, combine aspects, etc.
So many of the higher-level things people assume LLM's can't do, they can. But they don't do them "by default" because when someone asks for the solution to a particular problem, they're trained to by default just solve the problem the way it's presented. But you can just ask it to behave differently and it will.
If you want it to think critically and question all your assumptions, just ask it to. It will. What it can't do is read your mind about what type of response you're looking for. You have to prompt it. And if you want it to be super creative, you have to explicitly guide it in the creative direction you want.
You would be surprised about what the 4.5 models can already do in these ways of thinking. I think that one can unlock this power with the right set of prompts. It's impressive, truly.
It has already understood so much, we just need to reap the fruits.
I'm really looking forward to trying the new version.
New idea generation? Understanding of new/sparse/not-statistically-significant concepts in the context window? I think both being the same problem of not having runtime tuning. When we connect previously disparate concepts, like with a "eureka" moment, (as I experience it) a big ripple of relations form that deepens that understanding, right then. The entire concept of dynamically forming a deeper understanding from something new presented, from "playing out"/testing the ideas in your brain with little logic tests, comparisons, etc, doesn't seem to be possible. The test part does, but the runtime fine tuning, augmentation, or whatever it would be, does not.
In my experience, if you do present something in the context window that is sparse in the training, there's no depth to it at all, only what you tell it. And, it will always creep towards/revert to the nearest statistically significant answers, with claims of understanding and zero demonstration of that understanding.
And, I'm talking about relatives basic engineering type problems here.
I think the only real problem left is having it automate its own post-training on the job so it can learn to adapt its weights to the specific task at hand. Plus maybe long term stability (so it can recover from "going crazy")
But I may easily be massively underestimating the difficulty. Though in any case I don't think it affects the timelines that much. (personal opinions obviously)
> are there modes of thinking that fundamentally require something other than what current LLM architectures do?
Possibly. There are likely also modes of thinking that fundamentally require something other than what current humans do.
Better questions are: are there any kinds of human thinking that cannot be expressed in a "predict the next token" language? Is there any kind of human thinking that maps into token prediction pattern such that training a model for it would not be feasible regardless of training data and compute resources?
At the end of the day, the real world value is utility, some of their cognitive handicaps are likely addressable. Think of it like the evolution of flight by natural selection, flight is usefulness to make it worth it adapt the whole body to make flight not just possible but useful and efficient. Sleep falls in this category too imo.
We will likely see similar with AI. To compensate for some of their handicaps, we might adapt our processes or systems so the original problem can be solved automatically by the models.
There's an interesting historical angle here: Church's lambda calculus actually predates Turing machines by a few months (both 1936), and they're provably equivalent in computational power. Church even proved undecidability of the Entscheidungsproblem first, using lambda calculus.
Yet despite this head start, the Turing machine formalism became the dominant framework for CS theory—complexity classes, computability, formal verification. Whether that's path dependence, historical accident, or something deeper about how humans reason about computation, I'm not sure.
But it does make me wonder: if the imperative, state-based model proved more tractable even for theorists, maybe FP's learning curve isn't purely about unfamiliarity. There might be something genuinely harder about reasoning in terms of pure functions and recursion vs. "do X, then Y, update Z."
Fully acknowledge this is handwavy—curious if others have thoughts.
1/ it took me a while to understand this is a social network. I thought it was about giving me more visibility into my sessions/token usage
2/ Unless I am missing something it should be pretty easy to game the system.
3/ I think that landing the user on a global feed of all users will help
reply