I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
This surprises me too. I've experimented with using LLMs to convert lambda calculus expressions into combinatory logic. There is a simple deterministic way to do this, and LLMs claim to know it, and then they confidently fail.
Probably because there's a ton of code that deals with nested parentheses across languages in the training data, and models have learned how to work around tokenization limitations, when it comes to parentheses.
This is how I would deal with the problem if I maintained node: "Please, use your tokens and experimental energies to port to Rust and pass the following test suite. Let us know when you've got something that works."
Not only is it pushing production down, but the resulting high prices are almost certainly going to cause permanently lower demand in certain sectors and countries ("demand destruction").
I would love to see a complete accounting in a year or so.
Not necessarily. It's going to massively drive up the demand for coal and wood consumption where it used to be (comparatively) less-polluting gas. We won't really know until 6-12 months have passed and we've collected the data.
That can't be the whole story, right? Because there are an arbitrarily large number of (e.g.) Rust programs that will implement any given spec given in terms of unit tests, types, and perhaps some performance benchmarks.
But even accounting for all these "hard" constraints and metrics, there are clearly reasons to prefer some possible programs over others even when they all satisfy the same constraints and perform equally on all relevant metrics.
We do treat programs as efficient causes[1] of side effects in computing systems: a file is written, a block of memory is updated, etc. and the program is the cause of this.
But we also treat them as statements of a theory of the problem being solved[2]. And this latter treatment is often more important socially and economically. It is irrational to be indifferent to the theory of the problem the program expresses.
> there are clearly reasons to prefer some possible programs over others even when they all satisfy the same constraints
Maintainability is a big one missing from the current LLM/agentic workflow.
When business needs change, you need to be able to add on to the existing program.
We create feedback loops via tests to ensure programs behave according to the spec, but little to nothing in the way of code quality or maintainability.
Looks like a great implementation. I want to question the basic user story, which seems to be: "I am a software developer who wants to improve productivity by running multiple simultaneous agents that are roughly isomorphic to a human software developer team."
I am burning a lot of tokens every day at work and on personal projects. It's helpful. I generally work in tmux with github copilot in one pane, and a few other terminal panes showing tests and current diff.
I find it really important to avoid the temptation to multi-task by running multiple agents. For quite varied tasks, productivity gains from multi-tasking have proven to be illusory. Why would it be different with writing software?
"exporting your own oil and gas to be able to have a 'clean' (and up to recently heavily subsidized) transportation network is in a way just a gigantic bookkeeping trick"
How so?
If every oil exporter used some of their oil revenue to switch to EVs, that would, all things equal, hasten the transition to EVs. The U.S. is not doing that.
> the drug dealer that knows you don't consumer your own supply unless you must
So true. There's nothing incompatible at all with:
a) realizing that earth has gifted you with a valuable but limited & polluting energy source
b) realizing that you'd be foolish to get you own country hooked on it, but it's not a bad business if you can get other countries hooked on it.
Instead we get oil rich areas seemingly determined to show off how much of their oil they can waste.
Wow, so now the US oil barons who lobbied Trump to kill renewables and EVs are even worse than Mohammed "Bonesaw*" bin Salman Al Saud? That's really something, if you look at it that way...
Either you're too smart for me or I just can't follow you, but could you please expand a bit on your comment? I find it hard to link it to the parent, but I realize that may be on me.
Sorry, it was referring more to the grandparent comment, that referred to Saudi Arabia behaving more responsibly than the US, and Mohammed bin Salman is of course the crown prince and prime minister of Saudi Arabia.
They're comparing Saudi Arabia to a drug dealer; I don't think they're ascribing any moral virtue to the Saudi regime. They just believe the Saudis are acting more intelligently.
Yes? I don't think you can argue in good faith that the latter causes more total harm and damage than the former. It's really quite something to look at it in a different way..
The funny thing is the US doesn’t really consume much Saudi Oil. The US is a net exporter of oil, though they do import some specific types of oils and export more of others.
The US’s interest in the Middle East oil is a lot about stabilizing oil prices. At least it used to be when there was a rational policy and competent executors.
Transitioning to renewables makes economic sense for the Saudis because they make more money selling a barrel of oil for transportation fuel and generating power with wind and solar.
The US has vast reserves of coal and natural gas. We generally don't use oil to generate power either -- oil is something like 0.4% of the total power generated, because we have vast amounts of natural gas and coal to use instead.
The situation isn't the result of some crafty master plan on the part of the Saudis. It's jusut what makes sense.
The oil market is global and the US is a big part of that but it’s not the only one. You can always make changes to energy sources later and as new technologies are unlocked perhaps we can even skip some headaches now. Obviously there’s the geostrategic angle now which you see play out in Iran and Venezuela.
As other countries move to reliance on Chinese rare earth processing for renewable technology, it drives their oil and gas consumption down which means more oil and gas for those who are still using it.
If you really want to look at this analogy about drug dealers then really what you see is that America is the big boss here and an energy and military super power, and Saudi Arabia is just another dealer under American protection and if they don’t do what we tell them to do they’ll get the boot.
Like the drug dealers where I grew up they are making the neighborhood a really terrible place to live. They might have a nice house right now, but the homes around them are burning.
The US is moving the grid renewable. The guys at top might not think so and yell loudly not to, but they can't stop things, only put the brakes on a little.
They've pumped the brakes pretty hard by cutting EPA standards,
subsidizing coal,
suing to stop wind and solar projects,
cutting green energy grants by $8B,
yoinking solar tax credits,
trying to rewrite the Clean Air Act to block states from regulating emissions,
shield Big Oil from litigation for climate deception,
and repeating Big Oil's lies and disinformation.
Those rollouts are seeing massive cutbacks from what I've read, as half the country is straight up banning new solar. Good luck ever getting that off the books.
I don't think it will be that hard. Banning solar is a feel good thing now that doesn't affect many people - but that means when the next election is gone it won't be opposed when lobbyists (and greens) try to roll it back. Of course each state is different, so some it will take more than a few elections. In some states solar is already widespread enough that you can't ban it because too many people already have it and know enough about it to tell their friends. Those friends who live in other states will start to ask why they don't.
Remember you need to keep the 20 year plan in mind. If you only look to the end of 2026 things are hopeless, but look to 2050 (and compare to 2000) and things look much better.
As I said there, it's inherently something the LLM can't do, at least not without lots of engineering. So I'm assuming you're talking about "as a human" here.
Some of it is just trial and error. You notice it makes an incorrect assumption, it takes longer to find something than it should, and so on. Some of that can be predicted, simply by you knowing the codebase. If you sat down with a new hire to walk them through it and get them up to speed, what would you tell them? It'd be a waste of time to tell them about things they can easily figure out on their own within a minute by looking at filenames and so on. It's the low effort thing to do, but it also achieves nothing.
For example, "A's B component has a default C which should be overridden unless desired". If A is an internal library then you could just fix that if it goes against the LLM's common assumptions, but maybe it's an external dependency and it's not worth it.
Or maybe you're building a game, and there are a few core mechanics that are relevant to much of the logic. Then you can likely explain in a few sentences what would otherwise need hundreds of lines of code read across multiple files. So you put that in an AGENTS.MD file in a relevant folder so it gets autoloaded when touching any of that code.
"If every oil exporter used some of their oil revenue to switch to EVs, that would, all things equal, hasten the transition to EVs."
The premise is all things aren't equal. The oil Norway would have used just gets used somewhere else so what difference does it make what Norway does instead. I don't know if that's the reality of the situation but if it is just an offset, it does sound like a bookkeeping trick doesn't it?
Norway switching from ICEs to EVs objectively reduces global oil consumption+burning by exactly that much.
Norway exporting oil increases oil supply, but doesn't increase consumption. The world's oil consumers are not supply-constrained; the producers are not running at 100% capacity, and they'll happily pick up the slack if Norway just stopped exporting oil for no reason. And there's a large amount of consumption that can't be offset by electrification in the first place (petrochemicals, long distance flight, etc) so there's not even a theoretical future end-state where they require a non-EV-using counterparty to buy their oil to fund their EV usage.
Calling it a "bookkeeping trick" is just verbal sleigh-of-hand.
"Norway switching from ICEs to EVs objectively reduces global oil consumption+burning by exactly that much."
Meaning what they are in fact doing has the same effect as if they stopped producing/exporting oil exactly to the extent that it gets replaced by EVs over there? I could only see that happening if they undersell everyone in the world so they create no new consumers. I guess the truth is somewhere in the middle. I imagine the truth be known though? When Norway enters the market, how much other producers' sales go down?
This would be true but you're not accounting for OPEC and other groups (e.g. historically the Texas Railroad Commission in the United States, not sure how relevant they still are) to balance production and price per barrel to what they think is agreeable.
Oil hasn't been supply constrained since the 50's, it's price is largely based on what producing countries agree on, as well as geopolitics.
Additionally, governments levy a decent amount of taxes on certain end products such as gasoline. They might very well, as they have in the past, decide to simply up their tax revenue as prices of crude and derivatives go down.
Only if Norway's lack of internal consumption must be met with equal and similarly destructive consumption elsewhere.
Consider if others followed their lead. Then oil would be used less for transportation, one of its most destructive and singular uses, and more for manufacturing or medical or less wasteful uses.
metal, the bending, joining, of metal for humans to use for something™,who find me through the interwebs , which I have been useing since the dawn, off and on, clumsily, but since grade school. the apple store was one room above a chinese resturaunt and had painted chip board walls.
I have two web sites, one is a rental and I own the other, but I am focusing more and more on my core strengths in dealing with physical realities, which sometimes I call "applied geometry", though often there are curves and shapes that dont realy have names.
But as a good deal of the work is designed and comunicated about with the use of computers and phones, I also spend a lot of time thinking about how that could be better, so hanging out here , trying to fight the good fight, is part of most days.
According to that chart 2021
was anomalously low and it has been linearly returning to normal for the past four years.
AFAICT, the general populace is anxious about AI. So, the news knows they can get clicks with “You are right to be afraid. AI bad.” Meanwhile, CEOs know they can get stock boosts by saying “We are so AI we don’t need expenses. Infinite ROI!”
Put together we’re getting a ton of scary reporting on what looks like a quite normal business cycle (at least as far as layoffs go). And, everyone being afraid to hire is the only thing actually making it self-fulfilling.
I wouldn’t call the massive levels of investment by both private equity and municipal/state governments “business as usual.” The sums being thrown down and/or promised are staggering. People/groups that lose are going to lose big.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
reply