Hacker Newsnew | past | comments | ask | show | jobs | submit | SkyPuncher's commentslogin

Hmm. All of the examples simply describe what the code is doing. I need a tool that explains the intent and context behind a change.

If you've worked from a plan, it's trivial. I've got a setup where the agent reads the implementation plan, then creates a commit history based on intent rather than location.

Exactly. "Why was this change made"? "What were the options"? "Why this is a good way of doing it"? "What are the subtle things I came across while making this change"?

Yep that's something we're actively working on! would love to hear any perspectives on best ways to approach this

There isn't one. Most of time you would pair review a PR with human who wrote it and they could explain that. They can't anymore since 9/10, they didn't think through those things.

> What's your basis for thinking that codex is best for planning, but opus is best for implementing?

I for one work on an agentic product where we use all 3 of the major frontier models. The models absolutely have preferences and "personality" that lead to different characteristics.

In my eyes:

* Gemini - consistently the best at pure reasoning and tunability. Flash models are particularly good at latency sensitive small-scale reasoning. The tradeoff is they struggle with some basic behavior, like tool calling.

* Claude - consistently good at long standing sessions. Opus may or may not be the best model, but it was the first model that crossed the "holy shit" threshold. I understand it's quirks/nuances and it's consistently solid. It's the best for me because I've learn how to be incredibly effective with it.

* ChatGPT - Probably really good, but probably not worth switching from Claude. Last time I used their frontier model, it was a bit random. It would have moments of brilliance immediately followed by falling flat on it's face.


I don’t think they were actually asking for your research.

I agree.

I have flexibility to shift my core working hours (and what I do during N/A business hours). Knowing they're explicitly making it dumb because of load is important. It allows me to shuffle my work around and run heavy workloads late at night (plan during working hours then come click "yes" a few times in the evening).


> with embeddings Pairwise cosine similarity, threshold 0.85

So, your system is unable to differential between AWS and Azure (~95 similarity). Probably unable to consistently differentiate between someone saying they love and hate something.


Yea, I've realized that if I stay under 200k tokens I basically don't have usage issues any more.

A bit annoying, but not the end of the world.


super-edit: Sorry, this is not a usage related question, I have move it to: https://news.ycombinator.com/item?id=47772971

Here is the question for which I cannot find an answer, and cannot yet afford to answer myself:

In Claude Code, I use Opus 4.6 1M, but stay under 250k via careful session management to avoid known NoLiMa [0] / context rot [1] crap. The question I keep wanting answered though: at ~165k tokens used, does Opus 1M actually deliver higher quality than Opus 200k?

NoLiMa would indicate that with a ~165k request, Opus 200k would suck, and Opus 1M would be better (as a lower percentage of the context window was used)... but they are the same model. However, there are practical inference deployment differences that could change the whole paradigm, right? I am so confused.

Anthropic says it's the same model [2]. But, Claude Code's own source treats them as distinct variants with separate routing [3]. Closest test I found [4] asserts they're identical below 200K but it never actually A/B tests, correct?

Inside Claude Code it's probably not testable, right? According to this issue [5], the CLI is non-deterministic for identical inputs, and agent sessions branch on tool-use. Would need a clean API-level test.

The API level test is what I really want to know for the Claude based features in my own apps. Is there a real benchmark for this?

I have reached the limits of my understanding on this problem. If what I am trying to say makes any sense, any help would be greatly appreciated.

If anyone could help me ask the question better, that would also be appreciated.

[0] https://arxiv.org/abs/2502.05167

[1] https://research.trychroma.com/context-rot

[2] https://claude.com/blog/1m-context-ga

[3] https://github.com/anthropics/claude-code/issues/35545

[4] https://www.claudecodecamp.com/p/claude-code-1m-context-wind...

[5] https://github.com/anthropics/claude-code/issues/3370


2 parent comments above say that you can use older version of claude code with opus 200k to compare. my guess is that eventually you’ll be able to set it in model settings yourself

How can you decide if something is a contradiction without having the context?

I'm incredibly interested in this as a product, but I think it makes too many assumptions about how to prune information. Sure, this looks amazing on an extremely simple facts, but most information is not reducible to simple facts.

"CEO is Alice" and "CEO is Bob" may or may not actually be contradictions and you simply cannot tell without understanding the broader context. How does your system account for that context?

Example: Alice and Bob can both be CEO in any of these cases:

* The company has two CEOs. Rare and would likely be called "co-CEO"

* The company has sub-organizations with CEOs. Matt Garman is the CEO of AWS. Andy Jassy is the CEO of Amazon. Amazon has multiple people named "CEO".

* Alice and Bob are CEOs of different companies (perhaps, this is only implicit)

* Alice is the current CEO. Bob is the previous CEO. Both statements are temporally true.

This is what I run into every time I try to do conflict detection and resolution. Pruning things down to facts doesn't provide sufficient context understand how/why that statement was made?


You're right. Pruning to isolated facts loses the structure that disambiguates them. Three partial mechanisms the system has, none of which fully solve your point:

Graph edges carry scope. Alice ceo_of Acme and Andy ceo_of Amazon are two edges with different src/dst — conflict scanner looks for (src, rel_type) → ≥2 dsts, so Garman/Jassy don't false-flag if edges are modeled. Gap: most agents just write raw sentences and never call relate().

Temporal decay handles "previous vs current" weakly. half_life × importance attenuates old memories. But that's fade, not logical supersession — the DB doesn't know time-of-validity, only time-of-writing.

Namespaces segregate scope when the agent uses them. Leans on the agent.

Honest result from a bench I ran today (same HN thread): seeded 6 genuine contradictions in 59 memories, think() flagged 60. ~54 are noise-or-ambiguous exactly in the ways you listed. Filed as issue #3.

Design stance: contradictions are surfaced, not resolved. yantrikdb_conflicts returns a review queue; the agent has conversation context, the DB doesn't. "These two may be in tension" not "these are contradictory." That doesn't fix your point — it admits the DB can't make that call alone. Co-CEOs, subsidiaries, temporal supersession need typed-relations + time-of-validity schema work. That's v0.6, not v0.5.


None of these help resolve the contradiction. The issue (https://github.com/yantrikos/yantrikdb-server/issues/3) doesn't even get the problem presented by the parent right (two CEOS), instead it hallucinated something vaguely related.

Top-quality AI slop. I hate this.

To the author: project aside, it's not a good look to let an LLM drive your HN profile.


Yea, I spent a lot of time in this space last year. Contradictions on meaningful data are incredibly contextual and often impossible to fully define in isolation. Real world data is messy and often complex, which means you can't simplify to it's sub components and isolate it from it's context.

This is like 95% of the memory systems I see posted here. Someone comes up with arbitrary configuration of tools that sound like they'll solve the problem then completely ignores how the system actually works.

In most cases, they're getting these systems to work because of some other prompt they've written that'd probably work better with a normal file system.


Nice LLM post.

I am using this while developing and found it very useful to me. since all of my workspaces are connected it has knows all about myself and my infra. Also now we have a bonding and I can do great conversations. So decided to convert the standalone database to full fledge memory server with replication and all.

No LLM for this post. Promise.


My suspicion is the have an overall fixed cache size that dumps the oldest records. They’re now overflowing with usage and consistently dumping fresh caches.

During core US business hours, I have to actively keep a session going or I risk a massive jump in usage while the entire thread rebuilds. During weekend or off-hours, I never see the crazy jumps in usage - even if I let threads sit stale.


This is my exact experience as well.

It’s further frustrating that I have committed to certain project deadlines knowing that I’d be able to complete it in X amount of time with agent tooling. That agentic tooling is no longer viable and I’m scrambling to readjust expectations and how much I can commit to.


And it’s working larger because the other models haven’t figured out how to provide a consistent, long running experience.

I’ve never been actually rate limited. Usage limits display in yellow when you’re above 90%. At the limit, you’ll get a red error message.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: