My favourite part of a Canticle for Leibowitz is the manual auto regressive model the old monk is using to recover damaged books. I remember reading the gpt2 paper and thinking hang on...
As far as I can tell based on scanning forums, to the extent humans contribute anything to the centaur setup, it is entirely in hardware provisioning and allocating enough server time before matches for chess engines to do precomputation, rather than anything actually chess related, but I am unsure on this point.
I have heard anecdotally from non-serious players (and therefore I cannot be certain that this reflects sentiment at the highest levels although the ICCF results seem to back this up) that the only ways to lose in centaur chess at this point is to deviate from what the computer tells you to do, either intentionally or unintentionally by accidentally submitting the wrong move, or simply by being at a compute disadvantage.
I've got several previous comments on this because this is a topic that interests me a lot, but the two most topical here are the previous one and https://news.ycombinator.com/item?id=33022581.
The last public ranking of chess centaurs was 2014, after which it is generally held to be meaningless as the ranking of a centaur is just the same as the ranking of the engine. Magnus Carlsen’s peak elo of 2884 is by far the highest any human has ever achieved. Stockfish 18 is estimated to be in excess of 4000 elo. Which is to say the difference between it and the strongest human player ever is about the same as the difference between a strong club player and a grandmaster. It’s not going to benefit meaningfully from anything a human player might bring to the partnership.
Magnus himself in 2015 said we’ve known for a long time that engines are much stronger than humans so the engine is not an opponent.
OK, but who is saying that to the llm? Another llm?
We got feedback in this thread from someone who supposedly knows rust about common anti patterns and someone from the company came back with 'yeah that's a problem, we'll have agents fix it.'[0].
Agents are obviously still too stupid to have the meta cognition needed for deciding when to refactor, even at $1,000 per day per person. So we still need the buts in seats. So we're back at the idea of centaurs. Then you have to make the case that paying an AI more than a programmer is worth it.[1]
[0] which has been my exact experience with multi-agent code bases I've burned money on.
[1] which in my experience isn't when you know how to edit text and send API requests from your text editor.
Building Attractor
Supply the following prompt to a modern coding agent
(Claude Code, Codex, OpenCode, Amp, Cursor, etc):
codeagent> Implement Attractor as described by
https://factory.strongdm.ai/
Canadian girlfriend coding is now a business model.
Amusingly, it appears the README (that would be code, right?) has hallucinated the existence of a docker image - someone filed an issue at https://github.com/strongdm/cxdb/issues/1
In-house employees don't read code or do code reviews, so presumably they don't raise issues either. I guess the issue was picked up by an astute HN reader.
I've looked at their code for a few minutes in a few files, and while I don't know what they're trying to do well enough to say for sure anything is definitely a bug, I've already spotted several things that seem likely to be, and several others that I'd class as anti-patterns in rust. Don't get me wrong, as an experiment this is really cool, but I do not think they've succeeded in getting the "dark factory" concept to work where every other prominent attempt has fallen short.
To pick a few (from the server crate, because that's where I looked):
- The StoreError type is stringly typed and generally badly thought out. Depending on what they actually want to do, they should either add more variants to StoreError for the difference failure cases, replaces the strings with a sub-types (probably enums) to do the same, or write a type erased error similar to (or wrapping) the ones provided by anyhow, eyre, etc, but with a status code attached. They definitely shouldn't be checking for substrings in their own error type for control flow.
- So many calls to String::clone [0]. Several of the ones I saw were actually only necessary because the function took a parameter by reference even though it could have (and I would argue should have) taken it by value (If I had to guess, I'd say the agent first tried to do it without the clone, got an error, and implemented a local fix without considering the broader context).
- A lot of errors are just ignored with Result::unwrap_or_default or the like. Sometimes that's the right choice, but from what I can see they're allowing legitimate errors to pass silently. They also treat the values they get in the error case differently, rather than e.g. storing a Result or Option.
- Their HTTP handler has an 800 line long closure which they immediately call, apparently as a substitute for the the still unstable try_blocks feature. I would strongly recommend moving that into it's own full function instead.
- Several ifs which should have been match.
- Lots of calls to Result::unwrap and Option::unwrap. IMO in production code you should always at minimum use expect instead, forcing you to explain what went wrong/why the Err/None case is impossible.
It wouldn't catch all/most of these (and from what I've seen might even induce some if agents continue to pursue the most local fix rather than removing the underlying cause), but I would strongly recommend turning on most of clippy's lints if you want to learn rust.
This is great feedback, appreciate you taking the time to post it. I will set some agents loose on optimization / purification passes over CXDB and see which of these gaps they are able to discover and address.
We only chose to open source this over the past few days so it hasn't received the full potential of technical optimization and correction. Human expertise can currently beat the models in general, though the gap seems to be shrinking with each new provider release.
This is why I think AI generated code is going nowhere. There's actual conceptual differences that the stotastic parrot cannot understand, it can only copy patterns. And there's no distinction between good and bad code (IRL) except for that understanding
For those of us working on building factories, this is pretty obvious because once you immediately need shared context across agents / sessions and an improved ID + permissions system to keep track of who is doing what.
I was about to say the same thing! Yet another blog post with heaps of navel gazing and zero to actually show for it.
The worst part is they got simonw to (perhaps unwittingly or social engineering) vouch and stealth market for them.
And $1000/day/engineer in token costs at current market rates? It's a bold strategy, Cotton.
But we all know what they're going for here. They want to make themselves look amazing to convince the boards of the Great Houses to acquire them. Because why else would investors invest in them and not in the Great Houses directly.
You don't see why a company would gain to invite bloggers that will happily write positively about them? Talk about a conflict of interest, the FTC should ban companies from doing this.
We’ve been working on this since July, and we shared the techniques and principles that have been working for us because we thought others might find them useful. We’ve also open-sourced the nlspec so people can build their own versions of the software factory.
We’re not selling a product or service here. This also isn’t about positioning for an acquisition: we’ve already been in a definitive agreement to be acquired since last month.
It’s completely fair to have opinions and to not like what we’re putting out, but your comment reads as snarky without adding anything to the conversation.
Why will you be destitute? Consider this: how do billionaires make most of their money?
I’ll answer you: people buy their stuff.
What happens if nobody has jobs? Oh, that’s right! Nobody’s buying stuff.
Then what happens? Oh yeah! Billionaires get poorer.
There’s a very rational, self-interested reason sama has been running UBI pilots and Elon is also talking about UBI - the only way they keep more money flowing into their pockets is if the largest number of people have disposable income.
> What happens if nobody has jobs? Oh, that’s right! Nobody’s buying stuff.
> Then what happens? Oh yeah! Billionaires get poorer.
Or they pivot to businesses that don't depend on consumers buying stuff.
Or pivot away from business entirely, into a realm of pure power independent of the market and conventional economics.
> There’s a very rational, self-interested reason sama has been running UBI pilots and Elon is also talking about UBI - the only way they keep more money flowing into their pockets is if the largest number of people have disposable income.
There's another very rational, self-interested reason for those people to pursue UBI: as a temporary sop to the masses, to keep them passive until they lack the power to resist.
Can you give an example? His writing seems pretty grounded to me. He's not out there going on podcasts claimed that LLMs are going to cure cancer, afaik.
So I am on a web cast where people working about this. They are from https://docs.boundaryml.com/guide/introduction/what-is-baml and humanlayer.dev Mostly are talking about spec driven development. Smart people. Here is what I understood from them about spec driven development, which is not far from this AFAIU.
Lets start with the `/research -> /plan -> /implement(RPI)`. When you are building a complex system for teams you _need_ humans in the loop and you want to focus on design decisions. And having structured workflows around agents provides a better UX to those humans make those design decisions. This is necessary for controlling drift, pollution of context and general mayhem in the code base. _This_ is the starting thesis around spec drive development.
How many times have you working as a newbie copied a slash command pressed /research then /plan then /implement only to find it after several iterations is inconsistent and go back and fix it? Many people still go back and forth with chatgpt copying back and forth copying their jira docs and answering people's question on PRD documents. This is _not_ a defence it is the user experience when working with AI for many.
One very understandable path to solve this is to _surface_ to humans structured information extracted from your plan docs for example:
In this very toy spec driven development the idea is that each step in the RPI loop is broken down and made very deterministic with humans in the loop. This is a system designed by humans(Chief AI Officer, no kidding) for teams that follow a fairly _customized_ processes on how to work fast with AI, without it turning into a giant pile of slop. And the whole point of reading code or QA is this: You stop the clock on development and take a beat to see the high signal information: Testers want to read tests and QAers want to test behavior, because well written they can tell a lot about weather a software works. If you have ever written an integration test on a brownfield code with poor test coverage, and made it dependable after several days in the dark, you know what it feels like... Taking that step out is what all VCs say is the last game in town.. the final game in town.
This StrongDM stuff is a step beyond what I can understand: "no humans should write code", "no humans should read code", really..? But here is the thing that puzzles me even more is that spec driven development as I understand it, to use borrowed words, is like parents raising a kid — once you are a parent you want to raise your own kid not someone else's. Because it's just such a human in the loop process. Every company, tech or not, wants to make their own process that their engineers like to work with. So I am not sure they even have a product here...
This is the part that feels right to me because agents are idiots.
I built a tool that writes (non shit) reports from unstructured data to be used internally by analysts at a trading firm.
It cost between $500 to $5000 per day per seat to run.
It could have cost a lot more but latency matters in market reports in a way it doesn't for software. I imagine they are burning $1000 per day per seat because they can't afford more.
They are idiots, but getting better. Ex: wrote an agent skill to do some read only stuff on a container filesystem. Stupid I know, it’s like a maintainer script that can make recommendations, whatever.
Another skill called skill-improver, which tries to reduce skill token usage by finding deterministic patterns in another skill that can be scripted, and writes and packages the script.
Putting them together, the container-maintenance thingy improves itself every iteration, validated with automatic testing. It works perfectly about 3/4 of the time, another half of the time it kinda works, and fails spectacularly the rest.
It’s only going to get better, and this fit within my Max plan usage while coding other stuff.
LLMs are idiots and they will never get better because they have quadratic attention and a limited context window.
If the tokens that need to attend to each other are on opposite ends of the code base the only way to do that is by reading in the whole code base and hoping for the best.
If you're very lucky you can chunk the code base in such a way that the chunks pairwise fit in your context window and you can extract the relevant tokens hierarchically.
If you're not. Well get reading monkey.
Agents, md files, etc. are bandaids to hide this fact. They work great until they don't.
Can you say more? Guile's the only scheme I've tried (attempts at packaging for Guix). Debugging has been difficult, but I figured it was me struggling with new tools and API. Does racket have better facilities for introspection or discovery at the REPL?
Sortition is the only system that ensures high quality universal education. If anyone can become president for a year then everyone needs to be able to be president for a year.
I would like to see sortition implemented in one house of a bicameral legislature. Executive office is not where I would want to see it tested first (and I think it’s ill suited even in theory).
I've been building systems like what the OP is using since gpt3 came out.
This is the honeymoon phase. You're learning the ins and outs of the specific model you're using and becoming more productive. It's magical. Nothing can stop you. Then you might not be improving as fast as you did at the start, but things are getting better every day. Or maybe every week. But it's heaps better than doing it by hand because you have so much mental capacity left.
Then a new release comes up. An arbitrary fraction of your hard earned intuition is not only useless but actively harmful to getting good results with the new models. Worse you will never know which part it is without unlearning everything you learned and starting over again.
I've had to learn the quirks of three generations of frontier families now. It's not worth the hassle. I've gone back to managing the context window in Emacs because I can't be bothered to learn how to deal with another model family that will be thrown out in six months. Copy and paste is the universal interface and being able to do surgery on the chat history is still better than whatever tooling is out there.
Unironically learning vim or Emacs and the standard Unix code tools is still the best thing you can do to level up your llm usage.
First off, appreciate you sharing your perspective. I just have a few questions.
> I've gone back to managing the context window in Emacs because I can't be bothered to learn how to deal with another model family that will be thrown out in six months.
Can you expand more on what you mean by that? I'm a bit of a noob on llm enabled dev work. Do you mean that you will kick off new sessions and provide a context that you manage yourself instead of relying on a longer running session to keep relevant information?
> Unironically learning vim or Emacs and the standard Unix code tools is still the best thing you can do to level up your llm usage.
I appreciate your insight but I'm failing to understand how exactly knowing these tools increases performance of llms. Is it because you can more precisely direct them via prompts?
LLMs work on text and nothing else. There isn't any magic there. Just a limited context window on which the model will keep predicting the next token until it decides that it's predicted enough and stop.
All the tooling is there to manage that context for you. It works, to a degree, then stops working. Your intuition is there to decide when it stops working. This intuition gets outdated with each new release of the frontier model and changes in the tooling.
The stateless API with a human deciding what to feed it is much more efficient in both cost and time as long as you're only running a single agent. I've yet to see anyone use multiple agents to generate code successfully (but I have used agent swarms for unstructured knowledge retrieval).
The Unix tools are there for you to progra-manually search and edit the code base copy/paste into the context that you will send. Outside of Emacs (and possibly vim) with the ability to have dozens of ephemeral buffers open to modify their output I don't imagine they will be very useful.
Or to quote the SICP lectures: The magic is that there is no magic.
I can't speak for parent, but I use gptel, and it sounds like they do as well. It has a number of features, but primarily it just gives you a chat buffer you can freely edit at any time. That gives you 100% control over the context, you just quickly remove the parts of the conversation where the LLM went off the rails and keep it clean. You can replace or compress the context so far any way you like.
While I also use LLMs in other ways, this is my core workflow. I quickly get frustrated when I can't _quickly_ modify the context.
If you have some mastery over your editor, you can just run commands and post relevant output and make suggested changes to get an agent like experience, at a speed not too different from having the agent call tools. But you retain 100% control over the context, and use a tiny fraction of the tokens OpenCode and other agents systems would use.
It's not the only or best way to use LLMs, but I find it incredibly powerful, and it certainly has it's place.
A very nice positive effect I noticed personally is that as opposed to using agents, I actually retain an understanding of the code automatically, I don't have to go in and review the work, I review and adjust on the fly.
One thing to keep in mind is that the core of an LLM is basically a (non-deterministic) stateless function that takes text as input, and gives text as output.
The chat and session interfaces obscure this, making it look more stateful than it is. But they mainly just send the whole chat so far back to the LLM to get the next response. That's why the context window grows as a chat/session continues. It's also why the answers tend to get worse with longer context windows – you're giving the LLM a lot more to sift through.
You can manage the context window manually instead. You'll potentially lose some efficiencies from prompt caching, but you can also keep your requests much smaller and more relevant, likely spending fewer tokens.
I'll wait for OP to move their workflow to Claude 7.0 and see if they still feel as bullish on AI tools.
People who are learning a new AI tool for the first time don't realzie that they are just learning quirks of the tool and underlying and not skills that generalize. It's not until you've done it a few times that you realzie you've wasted more than 80% of your time on a model that is completely useless and will be sunset in 6 months.
5 was a great option for ml work last year since colo rented didn't come with a 10kW cable. With ram, sd and GPU prices the way they are now I have no idea what you'd need to do.
Thank goodness we did all the capex before the OpenAI ram deal and expensive nvidia gpus were the worst we had to deal with.
reply