A bad moment to have a make-or-break moment for your CPU business - a lot of customers will probably hold off purchases right now because of the RAM prices, no matter how good your CPU might be.
Isn't this new server CPU a drop in replacement though? So the DC could pull off the old CPU, drop in the new one and not touch the existing RAM setup, yet be able to deliver better performance within the limits of the existing RAM. Then once RAM prices drop (okay that might be a while) separately upgrade the RAM at a different time.
That's semi-dependent on supplier arrangements; i.e. lots of shops won't want to upgrade CPUs on a server out of fear that they can't get support later; sometimes that's justified by contract, sometimes it's not.
I have 2 thinkpads, and one of them is better in every aspect - except that the inferior one has it's 2 USB-C ports on opposite sides of the laptop, while the other one has both ports on the same side. Being able to plug in the charger from either side is really great, will definitely look for that in a future laptop.
I think there is a tendency to simply give in and buy bigger hardware if something doesn't work. With friends and family, I sometimes feel like having to talk them off the roof with regards to pulling the trigger on really expensive (relative to the tasks they're doing) hardware, simply because performance is often abysmal due to the fact that they trashed their OS with malware and bloatware and whatnot and can't understand all of that.
It's the same at work, to some degree. Our in-house ERP software performs like kicking a sack of rocks down a hill. I don't know how often I had to show devs that the hardware is actually idle and they're mostly derailing themselves with DB table locks, GC issues and whatnot. If I weren't pushing back, we probably would have bought the biggest VMs just to let them sit idle.
The blog is fine, it just looks like he didn't foresee that there would be a month where wouldn't post anything, so the navigation links break down. If you go to the last month he posted in, everything works as usual: http://blog.fefe.de/?mon=202505
> [...] I'm not even sure what exactly "Microsoft Copilot" entails anymore [...]
Watching from the sidelines (not a Microsoft user), I've completely lost track. Between this, the Azure 365 cloud whatever stuff, I have no idea what many of the products even exactly are any more.
Simply put Microsoft is the worst company at naming stuff. Even when they come up with a good name for something, they'll name 3 other totally different products the same thing to maximize confusion.
I gotta say though, I'm actually not sure which VMware (well Broadcom I suppose) products I use anymore. I'm pretty sure they took the Aria name off something else they called Aria for a little while. So Aria is no longer Aria but they still have Aria but it's what used to be called XYZ
Xbox Series with X > S (so if you want the high end of the current generation you want the Xbox Series X; if you want mid-range things are more complicated because you can now get an Xbox One X, but not the Xbox One, used for much less than you'd get an Xbox Series S for and which one is "better" is a dice roll depending on the games you want to play and if 4K matters to you…)
Series is a real weird word to use there. But it also doesn't help that the versions are extra complicated because with "PC-like compatibility" in everything after the Xbox One playing just about the entire same library you need a bit of a matrix to figure out which is best for you if you don't care about the "latest and greatest".
Oh wow yes, completely forgot about that one. To me, it's a complete blur made from single words and letters, one series x s one box 360? Maybe they should create a 365, with MS office pre-installed. Or something.
Seriously? Does anybody know what Copilot is? I don't think I have ever seem a "Copilot user", so I don't know what it looks like. Is it the little macro key on new laptop keyboards? The chatbot you get in Bing? A technical philosophy? Or is it in essence just copilot.com, the mediocre chat interface which you used to get free GPT-4 three years ago?
I wish. I got a Dell laptop for work and they've replaced the right Ctrl key with a Copilot key, and (because it's a locked-down work sysyem) the only thing I can remap that to is the Windows menu. And I keep hitting it out of muscle memory, interrupting everything. But at least now it doesn't launch Copilot.
Which I could add is "the only AI approved for use by IT" because they hate us.
> Which I could add is "the only AI approved for use by IT" because they hate us.
It's the same at our place. It's basically the lowest effort way as we already have data agreements with Microsoft 365 it eliminates a lot of the paperwork. And they do promise that they won't train on data even in the free (well, included with basic M365) version for corporate users. A lot of others don't unless you pay.
It's too bad because it seems to be the worst AI around. Even compared to ChatGPT itself which uses the same model as copilot in MS Office. I don't really understand why there's such a difference. If you do pay the $30 it's a bit better especially the researcher.
Double check if the (hopefully not locked) BIOS gives an option to customize the CTRL key. I had a previous work laptop which also got cute with the CTRL button, but thankfully did let you remap it.
I think it's highly circumstantial. For example, my personal servers run a lot of FreeBSD and even though I could stay on major releases for a rather long time, I usually upgrade almost as soon as new releases are available.
For servers at work, I tried running Fedora. The idea was that it would be easier to have small, frequent updates rather than large, infrequent updates.
Didn't work. App developers never had enough time to port their stuff to new releases of underpinning software, so we frequently had servers with unsupported OS version.
Gave up and switched to RockyLinux. We're in the process of upgrading the Rocky8-based stuff to Rocky9. Rocky9 was released 2022.
When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification.
Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code.
Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this.
So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding.
But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that.
The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.
Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?
It depends. If you have an LLM that uses reasoning the explanation for why decisions are made can often be found in the reasoning token output. So if the agent later has access to that context it could see why a decision was made.
The cursor-mirror skill and cursor_mirror.py script lets you search through and inschpekt all of your chat histories, all of the thinking bubbles and prompts, all of the context assembly, all of the tool and mcp calls and parameters, and analyze what it did, even after cursor has summarized and pruned and "forgotten" it -- it's all still there in the chat log and sqlite databases.
cursor-mirror skill and reverse engineered cursor schemas:
The German Toilet of AI
"The structure of the toilet reflects how a culture examines itself." — Slavoj Zizek
German toilets have a shelf. You can inspect what you've produced before flushing. French toilets rush everything away immediately. American toilets sit ambivalently between.
cursor-mirror is the German toilet of AI.
Most AI systems are French toilets — thoughts disappear instantly, no inspection possible. cursor-mirror provides hermeneutic self-examination: the ability to interpret and understand your own outputs.
What context was assembled?
What reasoning happened in thinking blocks?
What tools were called and why?
What files were read, written, modified?
This matters for:
Debugging — Why did it do that?
Learning — What patterns work?
Trust — Is this skill behaving as declared?
Optimization — What's eating my tokens?
See: Skill Ecosystem for how cursor-mirror enables skill curation.
>Žižek on toilets. Slavoj Žižek during an architecture congress in Pamplona, Spain.
>The German toilets, the old kind -- now they are disappearing, but you still find them. It's the opposite. The hole is in front, so that when you produce excrement, they are displayed in the back, they don't disappear in water. This is the German ritual, you know? Use it every morning. Sniff, inspect your shits for traces of illness. It's high Hermeneutic. I think the original meaning of Hermeneutic may be this.
>Hermeneutics (/ˌhɜːrməˈnjuːtɪks/)[1] is the theory and methodology of interpretation, especially the interpretation of biblical texts, wisdom literature, and philosophical texts. Hermeneutics is more than interpretive principles or methods we resort to when immediate comprehension fails. Rather, hermeneutics is the art of understanding and of making oneself understood.
----
Here's an example cursor-mirror analysis of an experiment with 23 runs with four agents playing several turns of Fluxx per run (1 run = 1 completion call), 1045+ events, 731 tool calls, 24 files created, 32 images generated, 24 custom Fluxx cards created:
Cursor Mirror Analysis: Amsterdam Fluxx Championship -- Deep comprehensive scan of the entire FAFO tournament development:
Just an update re German toilets: No toilet set up in the last 30 years (I know of) uses a shelf anymore. This reduces water usage by about 50% per flush.
LLMs often already "know" the answer starting from the first output token and then emulate "reasoning" so that it appeared as if it came to the conclusion through logic. There's a bunch of papers on this topic. At least it used to be the case a few months ago, not sure about the current SOTA models.
of course not, but it can often give a plausible answer, and it's possible that answer will actually happen to be correct - not because it did any - or is capable of any - introspection, but because it's token outputs in response to the question might semi-coincidentally be a token input that changes the future outputs in the same way.
Isn't that question a category error? The "why" the agent did that is that it was the token that best matched the probability distribution of the context and the most recent output (modulo a bit of randomness). The response to that question will, again, be the tokens that best match the probability distribution of the context (now including the "why?" question and the previous failed attempt).
if the agent can review its reasoning traces, which i think is often true in this era of 1M token context, then it may be able to provide a meaningful answer to the question.
Wait, no, that's the category error I'm talking about. Any answer other than "that was the most likely next token given the context" is untrue. It is not describing what actually happened.
I think this statement is on the same level as "a human cannot explain why they gave the answer they gave because they cannot actually introspect the chemical reactions in their brain." That is true, but a human often has an internal train of thought that preceded their ultimate answer, and it is interesting to know what that train of thought was.
In the same way, it is often quite instructive to know what the reasoning trace was that preceded an LLM's answer, without having to worry about what, mechanically, the LLM "understood" about the tokens, if this is even a meaningful question.
But it's not a reasoning trace. Models could produce one if they were designed to (an actual stack of the calls and the states of the tensors with each call, probably with a helpful lookup table for the tokens) but they specifically haven't been made to do that.
Unless things have changed drastically in the last 4 months (the last time I looked at it) those traces are not stored but reconstructed when asked. Which is still the same problem.
They aren't necessarily "stored" but they are part of the response content. They are referred to as reasoning or thinking blocks. The big 3 model makers all have this in their APIs, typically in an encrypted form.
Reconstruction of reasoning from scratch can happen in some legacy APIs like the OpenAI chat completions API, which doesn't support passing reasoning blocks around. They specifically recommend folks to use their newer esponses API to improve both accuracy and latency (reusing existing reasoning).
For a typical coding agent, there are intermediate tool call outputs and LLM commentary produced while it works on a task and passed to the LLM as context for follow up requests. (Hence the term agent: it is an LLM call in a loop.) You can easily see this with e.g. Claude Code, as it keeps track of how much space is left in the context and requires "context compaction" after the context gradually fills up over the course of a session.
In this regard, the reasoning trace of an agent is trivially accessible to clients, unlike the reasoning trace of an individual LLM API call; it's a higher level of abstraction. Indeed, I implemented an agent just the other day which took advantage of this. The OP that you originally replied to was discussing an agentic coding process, not an individual LLM API call.
Well, right, I see those reasoning stages in reasoning models with Ollama and if you ask it what its reasoning was after the fact what it says is different than what it said at the time.
I can't speak to your specific set up, but it sounds like you're halfway there if you can access the previous traces? All anyone can ask for is "show me the traces that led up to this point"; the "why did you do this" is a notational convenience for querying that data. If your set up isn't summarizing those traces correctly, then that sounds like a specific bug in the context or model quality, but the point is that the traces exist and are queryable in the first place, however you choose to do that.
(I am still primarily talking about agent traces, like the original OP, not internal reasoning blocks for a particular LLM call, though - which may or may not be available in context afterwards.)
In particular, asking "why" isn't a category error here, although there's only a meaningful answer if the model has access to the previous traces in its context, which is sometimes true and sometimes not.
There can be higher- and lower-level descriptions of the same phenomenon. when the kettle boils, it’s because the water molecules were heated by the electric element, but it’s also because I wanted a cup of tea.
If the reason the LLM retroactively invents for it's previous mistakes is still useful for getting the LLM to not make that kind of mistake again, then the distinction you're driving at doesn't matter.
> Any answer other than "that was the most likely next token given the context" is untrue.
"Because the matrix math resulted in the set of tokens that produced the output". "Because the machine code driving the hosting devices produced the output you saw". "Because the combination of silicon traces and charges on the chips at that exact moment resulted in the output". "Because my neurons fired in a particular order/combination".
I don't see how your statement is any more useful. If an LLM has access to reasoning traces it can realistically waddle down the CoT and figure out where it took a wrong turn.
Just like a human does with memories in context - does't mean that's the full story - your decision making is very subconscious and nonverbal - you might not be aware of it, but any reasoning you give to explain why you did something is bound to be an incomplete story, created by your brain to explain what happened based on what it knows - but there's hidden state it doesn't have access to. And yet we ask that question constantly.
If you want to be pedantic about it you could phrase it as follows.
When the LLM was in reasoning mode, in the reasoning context it often expressed statement X. Given that, and the relevance of statement X to the taken action. It seems likely that the presence of statement X in the context contributed to this action. Besides, the presence of statement X in the reasoning likely means that given the previous context embeddings of X are close to the context.
Hence we think that the action was taken due to statement X.
And that output could have come from an LLM introspecting it's own reasoning.
I don't think that phrasing things so pedanticaly is worth the extra precision though. Especially not for the statement that inspecting the reasoning logs of sn LLM can help give insight on why an LLM acted a certain way.
Just this morning I have run across an even narrower case of how AGENTS.md (in this case with GPT-5.3 Codex) can be completely ignored even if filled with explicit instructions.
I have a line there that says Codex should never use Node APIs where Bun APIs exist for the same thing. Routinely, Claude Code and now Codex would ignore this.
I just replaced that rule with a TypeScript-compiler-powered AST based deterministic rule. Now the agent can attempt to commit code with banned Node API usage and the pre-commit script will fail, so it is forced to get it right.
I've found myself migrating more and more of my AGENTS.md instructions to compiler-based checks like these - where possible. I feel as though this shouldn't be needed if the models were good, but it seems to be and I guess the deterministic nature of these checks is better than relying on the LLM's questionable respect of the rules.
We have pre-commit hooks to prevent people doing the wrong thing. We have all sorts of guardrails to help people.
And the “modern” approach when someone does something wrong is not to blame the person, but to ask “how did the system allow this mistake? What guardrails are missing?”
I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.
Even the "thinking" blocks in newer models are an illusion. There is no functional difference between the text in a thought block and the final answer. To the model, they are just more tokens in a linear sequence. It isn't "thinking" before it speaks, the "thought" is the speech.
Treating those thoughts as internal reflection of some kind is a category error. There is no "privileged" layer of reasoning happening in the silicon that then gets translated into the thought block. It’s a specialized output where the model is forced to show its work because that process of feeding its own generated strings back into its context window statistically increases the probability of a correct result. The chatbot providers just package this in a neat little window to make the model's "thinking" part of the gimmick.
I also wouldn't be surprised if asking it stuff like this was actually counter productive, but for this I'm going off vibes. The logic being that by asking that, you're poisoning the context, similar to how if you try generate an image by saying "It should not have a crocodile in the image", it will put a crocodile into the image. By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.
You're entirely correct in that it's a different model with every message, every token. There's no past memory for it to reference.
That said it can still be useful because you have a some weird behavior and 199k tokens of context, with no idea where the info is that's nudging it to do the weird thing.
In this case you can think of it less as "why did you do this?" And more "what references to doing this exist in this pile of files and instructions?"
Agreed. I wish more people understood the difference between tokens, embeddings, and latent space encodings. The actual "thinking" if you can call it that, happens in latent space. But many (even here on HN) believe the thinking tokens are the thoughts themselves. Silly meatbags!
Thinking happens in latent space, but the thinking trace is then the projection of that thinking onto tokens. Since autoregressive generation involves sampling a specific token and continuing the process, that sampling step is lossy.
However, it is a genuine question whether the literal meanings of thinking blocks are important over their less-observable latent meanings. The ultimate latent state attributable to the last-generated thinking token is some combination of the actual token (literal meaning) and recurrent thinking thus far. The latter does have some value; a 2024 paper (https://arxiv.org/abs/2404.15758) noted that simply adding dots to the output allowed some models to perform more latent computation resulting in higher-skill answers. However, since this is not a routine practice today I suspect that genuine "thinking" steps have higher value.
Ultimately, your thesis can be tested. Take the output of a reasoning model inclusive of thinking tokens, then re-generate answers with:
1. Different but semantically similar thinking steps (i.e. synonyms, summarization). That will test whether the model is encoding detailed information inside token latent space.
2. Meaningless thinking steps (dots or word salad), testing whether the model is performing detailed but latent computation, effectively ignoring the semantic context of
3. A semantically meaningful distraction (e.g. a thinking trace from a different question)
Look for where performance drops off the most. If between 0 (control) and 1, then the thinking step is really just a trace of some latent magic spell, so it's not meaningful. If between 1 and 2, then thinking traces serve a role approximately like a human's verbalized train of thought. If between 2 and 3 then the role is mixed, leading back to the 'magic spell' theory but without the 'verbal' component being important.
> I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.
"Thinking meat! You're asking me to believe in thinking meat!"
While next-token prediction based on matrix math is certainly a literal, mechanistic truth, it is not a useful framing in the same sense that "synapses fire causing people to do things" is not a useful framing for human behaviour.
The "theory of mind" for LLMs sounds a bit silly, but taken in moderation it's also a genuine scientific framework in the sense of the scientific method. It allows one to form hypothesis, run experiments that can potentially disprove the hypothesis, and ultimately make skillful counterfactual predictions.
> By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.
In my limited experience, this is not the right use of introspection. Instead, the idea is to interrogate the model's chain of reasoning to understand the origins of a mistake (the 'theory of mind'), then adjust agents.md / documentation so that the mistake is avoided for future sessions, which start from an otherwise blank slate.
I do agree, however, that the 'theory of mind' is very close to the more blatantly incorrect kind of misapprehension about LLMs, that since they sound humanlike they have long-term memory like humans. This is why LLM apologies are a useless sycophancy trap.
> Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.
Asking it why it did something isn’t useless, it just isn’t fullproof. If you really think it’s useless, you are way too heavily into binary thinking to be using AI.
I genuinely fail to see the usefulness, though, it seems counterproductive to me to do this kinda stuff. In my experience I just throw out the whole chat/session as soon as I notice it's starting to repeat mistakes/start doing stupid shit consistently, the few times I've tried interrogating it I could immediately tell all it was doing is, for lack of a better word, being a sycophant and aping my words back at me.
It hasn’t failed to be useful to me yet, even if it isn’t complete info about what went wrong. Better if you can ask it a specific question about what it did (why do you do X?). Sometimes it made a mistake and you can ask it how you can word instructions better to not make the mistake (useful in prompt engineering), sometimes I made an actual mistake and gave it conflicting instructions, sometimes it’s still something that can be fixed. Eventually it stops making mistakes because you’ve tested it enough and made your prompts robust. I guess your mileage will vary, but my experience is that it’s a conversation to get a good prompt, not a single one shot ask (which is why I save my prompts and reuse them).
It seems like LLMs in general still have a very hard time with the concepts of "doubt" and "uncertainty". In the early days this was very visible in the form of hallucinations, but it feels like they fixed that mostly by having better internal fact-checking. The underlying problem of treating assumptions as truth is still there, just hidden better.
LLMs are basically improv theater. If the agent starts out with a wildly wrong assumption it will try to stick to it and adapt it rather than starting over. It can only do "yes and", never "actually nevermind, let me try something else".
I once had an agent come up with what seemed like a pointlessly convoluted solution as it tried to fit its initial approach (likely sourced from framework documentation overemphasizing the importance of doing it "the <framework> way" when possible) to a problem for which it to me didn't really seem like a good fit. It kept reassuring me that this was the way to go and my concerns were invalid.
When I described the solution and the original problem to another agent running the same model, it would instantly dismiss it and point out the same concerns I had raised - and it would insist on those being deal breakers the same way the other agent had dimissed them as invalid.
In the past I've often found LLMs to be extremely opinionated while also flipping their positions on a dime once met with any doubt or resistance. It feels like I'm now seeing the opposite: the LLM just running with whatever it picked up first from the initial prompt and then being extremely stubborn and insisting on rationalizing its choice no matter how much time it wastes trying to make it work. It's sometimes better to start a conversation over than to try and steer it in the right direction at that point.
Gentoo is what really made Linux click for me, too. I'm still very, very glad for that and remain a loyal user to this day!
Although I've had to restrict it to the 2 desktop machines. Maybe I should give it a shot again on the laptops, now that binary packages are universally available...
reply