Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI O3 breakthrough high score on ARC-AGI-PUB (arcprize.org)
1724 points by maurycy 27 days ago | hide | past | favorite | 1755 comments



Efficiency is now key.

~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.

We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)

Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!


some other imporant quotes: "Average human off the street: 70-80%. STEM college grad: >95%. Panel of 10 random humans: 99-100%" -@fchollet on X

So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)

Also, some other back of envelope calculations:

The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.

The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)

I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.


It's also worth keeping in mind that AIs are a lot less risky to deploy for businesses than humans.

You can scale them up and down at any time, they can work 24/7 (including holidays) with no overtime pay and no breaks, they need no corporate campuses, office space, HR personnel or travel budgets, you don't have to worry about key employees going on sick/maternity leave or taking time off the moment they're needed most, they won't assault a coworker, sue for discrimination or secretly turn out to be a pedophile and tarnish the reputation of your company, they won't leak internal documents to the press or rage quit because of new company policies, they won't even stop working when a pandemic stops most of the world from running.


I get the excitement, but folks, this is a model that excels only in things like software engineering/math. They basically used reinforcement learning to train the model to better remember which pattern to use to solve specific problems. This in no way generalises to open ended tasks in a way that makes human in the loop unnecessary. This basically makes assistants better (as soon as they figure out how to make it cheaper), but I wouldn't blindly trust the output of o3. Sam Altman is still wrong: https://www.lycee.ai/blog/why-sam-altman-is-wrong


In your blog you say:

> deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence.

I think even (or especially) people like Altman accept this as a fact. I do. Hassabis has been saying this for years.

The foundational models are just a foundation. Now start building the AGI superstructure.

And this is also where most of the still human intellectual energy is now.


You lost me at the end there.

These statistical models don’t generalize well to out of distribution data. If you accept that as a fact, then you must accept that these statistical models are not the path to AGI.


Quite. And if it was right, those businesses deploying it and replacing humans need humans with jobs and money to pay for their products and services…


It will just keep bleeding the middle class on and on, till the point where either everyone is rich, homeless or a plumber or other such licensed worker. And then there will be such a glut in the latter (shrinking) market, that everyone in that group also becomes either rich or homeless.


Productivity gains increase the standard of living for everyone. Products and services become cheaper. Leisure time increases. Scarce labor resources can be applied in other areas.

I fail to see the difference between AI-employment-doom and other flavors of Luddism.


It also fuels the income inequality with a fatter pipe in every iteration. You get richer as you move up in the supply chain, period. Companies vertically integrate to drive costs down in the long run.

As AI gets more prevalent, it'll drive the cost down for the companies supplying these services, so the former employees of said companies will be paid lower, or not at all.

So, tell me, how paying fewer people less money will drive their standard of living upwards? I can understand the leisure time. Because, when you don't have a job, all day is leisure time. But you'll need money for that, so will these companies fund the masses via government to provide Universal Basic Income, so these people can both live a borderline miserable life while funding these companies to suck these people more and more?


It also fuels the income inequality with a fatter pipe in every iteration

Who cares? A rising tide lifts all boats. The wealthy people I know all have one thing in common: they focused more on their own bank accounts than on other people's.

So, tell me, how paying fewer people less money will drive their standard of living upwards?

Money is how we allocate limited resources. It will become less important as resources become less limited, less necessary, or (hopefully) both.


> Money is how we allocate limited resources. It will become less important as resources become less limited, less necessary, or (hopefully) both.

Money is also how we exert power and leverage over others. As inequality increases, it enables the ever wealthier minority to exert power and therefore control over the majority.


If that's a problem, why does the progressive point of view typically argue in favor of giving more power over our lives to the ruling class?

The problem isn't the money. The problem is the power.


> why does the progressive point of view typically argue in favor of giving more power over our lives to the ruling class?

Humans are interesting creatures. Many of them do not have conscience and don't understand the notion of ethics and "not doing of something because it's wrong to begin with". From my experience, esp. the people in US thinks that "if that's not illegal, then I can and will do this", which is wrong in many levels again.

Many European people are similar, but bigger governments and harsher justice system makes them more orderly, and happier in general. Yes, they can't carry guns, but they don't need to begin with. Yes, they can't own Cybertrucks, but they can walk or use an actually working mass transportation system to begin with.

Plus proper governments have checks and balances. A government can't rip people off like corporations for services most of the time. Many of the things Americans are afraid of (social health services for everyone) makes life more just and tolerable for all parts of the population.

Big government is not a bad thing, and uncontrollable government is. We're entering the era of "corporate pleasing uncontrollable governments", and this will be fun in a tragic way.


"Many European people are similar, but bigger governments and harsher justice system makes them more orderly, and happier in general. Yes, they can't carry guns, but they don't need to begin with. Yes, they can't own Cybertrucks, but they can walk or use an actually working mass transportation system to begin with."

This comment is a festival of imprecise stereotypes.

Gun laws vary widely across Europe, as does public safety (both the real thing and perception of; if you avoid extra rapes by women not venturing outside after dark, the city isn't really safe), as does the overall lavel of personal happiness, as does the functionality of public transport systems.

And the quality of public services doesn't really track the size of the government even in Europe that well. Corruption eats a lot of the common pie.


> This comment is a festival of imprecise stereotypes.

I might be overgeneralizing, but I won't accept the "festival of imprecise stereotypes" claim. This is what I got with working with too many people from too many countries in Europe for close to two decades. I travel at least twice a year, and basically live with them for short periods of time. So this is not by reading some questionable social media sites and being an armchair sociologist.

> Gun laws vary widely across Europe...

Yet USA has 3x armed homicide cases in developed world when compared with its closest follower, and USA is the "leader" of the pack. 24 something vs. 8 something.

> as does public safety

Every city has safe and unsafe areas. Even your apartment has unsafe areas.

> as does the overall lavel of personal happiness, as does the functionality of public transport systems.

Of course, but even if DB has a two hour delay because of a hold-up at Swiss border, I can board a Eurostar and casually can see another country for peanuts money. Happiness changes due to plethora of reasons. Like Swedes' daylight duration problems in winter, or economic downturn in elsewhere.

> And the quality of public services doesn't really track the size of the government even in Europe that well. Corruption eats a lot of the common pie.

Sadly corruption in Europe is on the rise when compared to the last decade. I can see that. However, at least many countries have a working social security systems, NHS not being one of them, sadly.


Lmao attacking stereotypes with stereotypes.

Please what cities? You are just making up rape stats. That’s makes you the bigger idiot here.

Ohh yeah so much corruption I don’t literally enjoy Zagreb more than any US city I have been to and it’s not even special. Because if this is just have the shittiest argument ever there’s my anecdotal rebuttal.


> We're entering the era of "corporate pleasing uncontrollable governments", and this will be fun in a tragic way.

Right, so the answer is not to make that bad government bigger, the answer is to replace it with a good government. Feeding a cancer tumor doesn't make it better.


> Right, so the answer is not to make that bad government bigger, the answer is to replace it with a good government.

Bad government (where by bad I mean serving the interests of the wealthy few over the masses) is bad regardless of it's size.

If you believe in supply-side/trickle-down economics, you might use the opposite definition of "bad", in which case shrinking of government that restrains corporations (protecting the masses) by regulation or paying for seniors not to end up it total destitution (social security /Medicare)

The size of the government is less relevant than what it is doing, and whether you agree with that.


Trickle down economics isn't something you either "believe in" or "don't believe in". It's a disproven theory that does not work.


It's not always true that progressives are for more government power. See the death penalty for example. It's pretty much the ultimate power a government could have, and who advocates for it? It's not progressives I believe.


You think the death penalty is an exercise of ultimate power? It’s more an exercise of vengeance.

The ultimate exercise of government power is keeping someone locked in a tiny cell for the rest of their life where their bed is next to their toilet and you make them beg a faceless bureaucracy that has no accountability annually for some form of clemency via parole, all while the world and their family moves on without them.


I don't necessarily agree with that, but even if it's true, I think my main point still stands about who is likely to support either thing.


In a free democracy, I think Progressives see the ruling class as those in a position to influence democratic rule with an outsized influence compared to 1 person, 1 vote. And that are those with money, or with too much centralized media power or popularity.

The employees of the government and those elected are not seen as the ruling class by progressives, but just normal people that have the qualifications and are employed to manage the government on behalf of the people.

It's important therefore that those elected and put in charge of the government are in a position where they don't have the power to benefit themselves or their friends/family, but are in a position where they can wield power to benefit the people who hired them for the job (their constituents), and that if they fail to do so, they can get replaced.


To be blunt: It doesn't.

The modern political binary was originally constructed in the ashes of the French Revolution, as the ruling royalty, nobility and aristocracy recoiled in horror at the threat that masses of angry poor people now posed. The left wing thrived on personal liberty, tearing down hierarchies, pursuing "liberty, equality, fraternity". The right wing perceives social hierarchy as a foundational good, sees equality as anarchy and order (and respect for property) as far more important than freedom. For a century they experienced collective flashbacks to French Revolutionaries burning fine art for firewood in an occupied chateau.

Notably, it has not been a straight line of Social Progress, nor a simple Hegelian dialectic, but a turbulent winding path between different forces of history that have left us with less or more personal liberty in various eras. But... well... you have to be very confused about what they actually believe now or historically to understand progressives or leftists as tyrants who demand hierarchy.

That confusion may come from listening to misinformation from anticommunists, a particular breed of conservative who for the past half century have asserted that ANY attempt to improve government or enhance equality was a Communist plot by people who earnestly wanted Soviet style rule. One of those anticommunists with a business empire, Charles Koch, funded basically all the institutions of the 'libertarian' movement, and later on much of the current GOP's brand of conservatism.


> If that's a problem, why does the progressive point of view typically argue in favor of giving more power over our lives to the ruling class?

You've literally reversed the meaning of the term "progressive" by replacing it with the meaning of the term "oligarchic".

Progressives argue for less invasion by government in our personal lives, and less unequal distribution of wealth and power. They are specifically opposed to power being delivered to a ruling class.

> The problem isn't the money. The problem is the power

These are nearly inseparable in current (and frankly most past) societies. Pretending that they are not is a way of avoiding practical solutions to the problem of the distribution of power.


I really get the feeling that people do not understand that progressive is almost a synonym for Libleft.

Those damn authoritarians, stripping the power from the oligarchs by massively taxing the rich and defunding the police. The bastards.


Progressive is more precise than that. There are specific policies you can examine from the Progressive Era.

Ultimately it was the 'oligarchs' who argued in favor of the progressive agenda. Wall Street created the 3rd central bank, the US Federal Reserve. The AMA closed all of the mutual aid societies and their hospitals. Railroad barons lobbied for subsidies and price controls to eliminate their competitors. Woodrow Wilson declared war on Germany, "The world must be made safe for democracy"

Of course no would be authoritarian claims more power without claiming that they are doing it for the greater good or to attack the rich classes. The Progressive Era was smorgasbord of special interest handouts and grants to cartels. All of this was lobbied for by oligarchs.


Yeah, that commenter wildly misunderstands what "progressive" means. Like full on got the definition of the word backwards.

Is this common? People think "progressive" means "complete government control"?

Progressives support regulations to prevent both public and private entities from becoming too powerful. It's not like they want to give the government authoritarian control lol.


I guess it depends on what you're defining as "the ruling class", because I believe most progressives would define it as "the wealthy" and would certainly not be in favor of that. Look at the AOC/Pelosi rift, for instance.


Politicians are a part of the ruling class for any sensible definition of the word.


In free democracies, politicians are elected representatives, not rulers. They are accountable to voters through regular elections. Power is distributed across multiple branches/institutions. Citizens have protected rights and freedoms. Politicians can be voted out or recalled. Laws apply equally to politicians and citizens.

In practice, there's always a slippery slope, can wealthy people integrate themselves in that power structure, lobbying, media control, strength of checks and balances, level of corruption/transparency, etc. But when that slips, we stop calling it a free democracy, and it becomes an oligarchy, or a plutocracy, or an illiberal democracy.


Politicians do come in different flavours. There are some elected officials with good intentions. See again the AOC/Pelosi rift.

The more we regulate to get money out of politics, the more good people will have a shot at being elected.

These are all common progressive values. No true progressive supports wealthy unethical politicians gaining more power. Anyone telling you so is not speaking in good faith, or they are misinformed.


Why would "resources" become less limited or necessary just because there's some AGI controlled by a few people? You're assuming a lot here.

Separately, is it "rising tide lifts all boats" or "pull yourself up by your bootstraps" that drives the common person's progress? You seem confused which metaphor to apply while handwaving the discussion away.


Why would "resources" become less limited or necessary just because there's some AGI controlled by a few people? You're assuming a lot here.

The Luddites asked a similar question. The ultimate answer is that it doesn't matter that much who controls the means of production, as long as we have access to its fruits.

As long as manual labor is in the loop, the limits to productivity are fixed. Machines scale, humans don't. It doesn't matter whether you're talking about a cotton gin or a warehouse full of GPUs.

Separately, is it "rising tide lifts all boats" or "pull yourself up by your bootstraps" that drives the common person's progress? You seem confused which metaphor to apply while handwaving the discussion away.

I haven't invoked the "bootstrap" cliché here, have I? Just the boat thing. They make very different points.

Anyway, never mind the bootstraps: where'd you get the boots? Is there a shortage of boots?

There once was a shortage of boots, it's safe to say, but automation fixed that. Humans didn't, and couldn't, but machines did. Or more properly, humans building and using machines did.


> The ultimate answer is that it doesn't matter that much who controls the means of production, as long as we have access to its fruits.

That mattered a lot in communist places, we saw it fail. Same thing with most authoritarian regime today, it's a crap shoot. You simply can't entrust a small group with full control on the means of production and expect them to make it efficient, cheap, innovative, sustainable and affordable.


> Who cares? A rising tide lifts all boats.

Apparently people who are not wealthy enough to buy a boat and afraid of drowning care about this a lot. Also, for whom the tide rises? Not for the data workers which label data for these systems for peanuts, or people who lose jobs because they can be replaced with AI, or Amazon drivers which are auto-fired by their in-car HAL9000 units which label behavior the way they see fit.

> The wealthy people I know all have one thing in common: they focused more on their own bank accounts than on other people's.

So, the amount of money they have is much more important than everything else. That's greed, not wealth, but OK. I'm not feeling like dying on the hill of greedy people today.

> Money is how we allocate limited resources.

...and the wealthy people (you or I or others know) are accumulating amounts of it which they can't make good use of personally, I will argue.

> It will become less important as resources become less limited, less necessary, or (hopefully) both.

How will we make resources less limited? Recycling? Reducing population? Creating out of thin air?

Or, how will they become less necessary? Did we invent materials which are more durable and cheaper to produce, and do we start to sell it to people for less? I don't think so.

See, this is not a developing country problem. It's a developed country problem. Stellantis is selling inferior products for more money, while reducing workforce , closing factories, replacing metal parts with plastics, and CEO is taking $40MM as a bonus [0], and now he's apparently resigned after all that shenanigans.

So, no. Nobody is making things cheaper for people. Everybody is after the money to rise their own tides.

So, you're delusional. Nobody is thinking about your bank account that's true. This is why resources won't be less limited or less necessary. Because all the surplus is accumulating at people who are focused on their own bank accounts more than anything else.


How will we make resources less limited? Recycling? Reducing population? Creating out of thin air?

We've already done it, as evidenced by the fact that you had the time and tools to write that screed. Your parents probably didn't, and your grandparents certainly didn't.


No, it doesn't prove anything. To be brutally honest, I have just eaten a meal, and have 30 minutes of relax time. Then I'll close this 10 year old laptop and continue what I need to do.

No, my parents had that. Instead, they were chatting on the phone. My grandparents already had that too. They just chatted at the hall in front of the house with their neighbors.

We don't have time. We are just deluding ourselves. While our lives are technologically better, and we live longer, our lives are not objectively healthier and happier.

Heck, my colleagues join teleconferences from home with their kid's voice at the background and drying clothes visible, only hidden by the Gaussian blur or fake background provided by the software.

How they have more time to do more things? They still work 8 hours a day, doing the occasional overtime.

Things have changed and evolved, but evolution and change doesn't always bring progress. We have progressed in other areas, but justice, life conditions and wealth are not in this list. I certainly can't buy a house just because I want one like my grandparents did, for example.


Why wouldn't the people at the top siphon off literally 100% of the benefit whilst the people displaced bear 100% of the cost?


Just to clarify: the Luddites were being automated out of a job.

From what I understand of history, while industrial revolutions have generally increased living standards and employment in the long term, they have also caused massive unemployment/starvation in the short term. In the case of textile, I seem to recall that it took ~40 years for employment to return to its previous level.

I don't know about you guys, but I'm far from certain that I can survive 40 years without a job.


In addition, although the Luddite uprisings were themselves crushed, the political elite were not blind to the circumstances that led to them, and did eventually bring in the legislation that introduced modern workers rights, legalized unions and sowed the seeds of the modern secular welfare state in Britain. That is a pattern that appears throughout history and especially in Britain, where the government cannot be seen to yield to violent protest but quietly does so anyway.


And among the few who found a job back, most of the time it was some coal mining job, to feed the machines who replaced them... Maybe the future of (some of) nowadays' office workers is to feed (train) the models replacing them?


I cannot find a place industrial revolutions caused massive starvation. Care to provide one?

The other things you state are not even close.

First, lowered employment for X years does not imply one cannot get a job in X years - that's simply fear mongering. Unemployment over that period seems to have fluctuated very little, and massive external economic issues were causes (wars with Napoleon, the US, changing international fortunes), not Luddites.

Next, there was inflation and unemployment during the TWO years surrounding the Luddites, in 1810-1812 (starting right before the Luddite movement) due to wars with Napoleon and the US [1]. Somehow attributing this to tech increases or Luddites is numerology of the worst sort.

If you look at academic literature about the economy of the era, such as [2] (read on scihub if you must), you'll find there was incredible population growth, and that wages grew even faster. While many academics at the at the time thought all this automation would displace workers, those academics were forced to admit they were wrong. There's plenty of literature on this. Simply dig through Google scholar.

As to starvation in this case, I can find no "massive starvation". [3] forExample points out that "Among the industrial and mining families, around 18 per cent of writers recollected having experienced hunger. In the agricultural families this figure was more than twice as large — 42 per cent".

So yes there was hunger, as there always had been, but it quickly reduced due to the industrial revolution and benefited those working in industry more quickly than those not in industry.

[1] https://en.wikipedia.org/wiki/Luddite#:~:text=The%20movement....

[2] https://www.jstor.org/stable/2599511

[3] https://academic.oup.com/past/article/239/1/71/4794719


Thanks for your response.

My bad for "massive starvation", that's clearly a mistake, I meant to write something along the lines of "massive unemployment – and sometimes starvation". Sadly, too late to amend.

Now, I'll admit that I don't have my statistics at hand. I quoted them from memory from, if I recall correctly, _Good Economics for Hard Times_. I'm nearly certain about the ~40 years, but it's entirely possible that I confused several parts of the industrial revolution. I'll double-check when I have an opportunity.


Leisure time hasn’t increased in the last 100 years except for the lower income class which doesn’t have steady employment. But yes, I see your point that the homeless person who might have had a home if he had a (now automated) factory job should surely feel good about having a phone that only the ultra rich had 40 years ago.


It's not worth tossing away in sarcasm.

The availability of cheaply priced smartphones and cellular data plans has absolutely made being homeless suck less.

As you noted though, a home would probably be a preferable alternative.


> As you noted though, a home would probably be a preferable alternative.

The problem is that the preferable option (housing) won't happen because unlike a smartphone, it requires that land be effectively distributed more broadly (through building housing) in areas where people desire to live. Look at the uproar by the VC guys in Menlo Park when the government tried to pursue greater housing density in their wealthy hamlet.

It also requires infrastructure investment which, while it has returns for society at large, doesn't have good returns for investors. Only government makes those kinds of investments.

Better to build a wall around the desirable places, hire a few poorer-than-you folks as security guards, and give the other people outside your wall ... cheap smartphones to sate themselves.


wall isnt necessary just need the police, security guards and legislation to chase out / make homeless miserable.


Indeed, all physical walls in our world are ultimately psychological walls.


I think the backlash to this post can summarized as such:

Perhaps there is a theory in which productivity gains increase the standard of living for everyone, however that is not the lived reality for most people of the working classes.

If productivity gains are indeed increasing the standards of living to everyone, it certainly does not increase evenly, and the standard of living increases for the working poor are at best marginal, while the standard of living increases for the already richest of the rich are astronomical.


> and the standard of living increases for the working poor are at best marginal

Not if you count the global poor, the global poors standard of living has increased tremendously the past 30 years.


Has it really? I’ve seen a lot of people claiming this since Hans Rosling’s famous TED talks, but I’ve never actually seen any data that backs this up. Particularly since Hans Rosling’s talk was 15 years ago, but the number always remains “past 30 years”.

Off course any graph can choose to show which ever stat is convenient for the message, that doesn’t necessarily reflect the lived reality of the individual members of the global poor. And as I recall it most standard of living improvements for the global poor came in the decades after decolonization in the 1960s-1990s where infrastructure was being built that actually served people’s need as opposed for resource extraction in the decades past. If Hans Rosling said in 2007 that the standard of living has improved tremendously in the past 30 years, he would be correct, but not for the reason you gave.

The story of decolonization was that the correct infrastructure, such as hospitals, water lines, sewage, garbage disposal plants, roads, harbors, airports, schools, etc. that improved the standard of living not productivity gains. And case in point, the colonial period saw a tremendous growth in productivity in the colonies. But the standard of living in the colonies quite often saw the opposite. That is because the infrastructure only served to extract resources and exploitation of the colonized.


The prosperity gap has shrunk quite a lot, and these trends are broadly in the right direction since ~1990:

https://blogs.worldbank.org/en/opendata/updated-estimates-pr...

For extreme poverty progress has recently slowed down, the trend there is still positive but very slow - improvement there is needed.


> Productivity gains increase the standard of living for everyone

This just isn’t true, necessarily. Productivity has gone up in the US since the 80s, but wages have not. Costs have, though.

What increases standards of living for everyone is social programs like public health and education. Affordable housing and adult-education and job hunting programs.

Not the rate at which money is gathered by corporations.


Utter nonsense. Productivity gains of the last 40 years have been captured by shareholders and top elites. Working class wages have been flat all of that time despite that gain.

In 2012, Musk was worth $2 billion. He’s now worth 223 times that yet the minimum wage has barely budged in the last 12 years as productivity rises.


>>Productivity gains increase the standard of living for everyone.

>Productivity gains of the last 40 years have been captured by shareholders and top elites. Working class wages have been flat...

Wages do not determine the standard of living. The products and services purchased with wages determine the standard of living. "Top elites" in 1984 could already afford cellular phones, such as the Motorola DynaTAC:

>A full charge took roughly 10 hours, and it offered 30 minutes of talk time. It also offered an LED display for dialing or recall of one of 30 phone numbers. It was priced at US$3,995 in 1984, its commercial release year, equivalent to $11,716 in 2023.

https://en.wikipedia.org/wiki/Motorola_DynaTAC

Unfortunately, touch screen phones with gigabytes of ram were not available for the masses 40 years ago.


What a patently absurd POV! A phone doesn’t compensate for the inability to solve for basic needs - housing, healthy food, healthcare. Or being unable to invest in skill development for themselves or their offspring, save for retirement.


It is also highly likely that the cost of that phone was externalized onto a worker in a poorer country that doesn’t even have basic necessity like a running water, 24 hour electricity, food security, etc.


Most of it is made in China, China isn't that poor any more it is like Mexico so people have running water and food security and way more than that as well.


I was more thinking about the miners who gather the raw resources for those phones.


Loans for phones are very common in the developing world.

Rather than a luxury, they've become an expensive interest bearing necessity for billions of human beings.


Please do this but with college education, medical, and childcare costs, otherwise it's just cherry picking.


Never happened with neither big technology advancement


Wealth has bled from landlords to warlords and now bleeding to techlords.

Warlords are still rich, but both money and war is flowing towards tech. You can get a piece from that pie if you're doing questionable things (adtech, targeting, data collection, brokering, etc.), but if you're a run of the mill, normal person, your circumstances are getting harder and harder, because you're slowly squeezed out of the system like a toothpaste.


> you're slowly squeezed out of the system like a toothpaste.

AI could theoretically solve production but not consumption. If AI blows away every comparative advantage that normal humans have then consumption will collapse and there won’t be any rich humans.


AI has a different risk profile than humans. They are a lot more risky for business operations where failure is wholly unacceptable under any circumstance.

They're risky in that they fail in ways that aren't readily deterministic.

And would you trust your life to a self-driving car in New York City traffic?


This is a really hard and weird ethical problem IMHO, and one we'll have to deal with sooner or later.

Imagine you have a self-driving AI that causes fatal accidents 10 times less often than your average human driver, but when the accidents happen, nobody knows why.

Should we switch to that AI, and have 10 times fewer accidents and no accountability for the accidents that do happen, or should we stay with humans, have 10x more road fatalities, but stay happy because the perpetrators end up in prison?

Framed like that, it seems like the former solution is the only acceptable one, yet people call for CEOs to go to prison when an AI goes wrong. If that were the case, companies wouldn't dare use any AI, and that would basically degenerate to the latter solution.


I don't know about your country, but people going to prison for causing road fatalities is extremely rare here.

Even temporary loss of the drivers license has a very high bar, and that's the main form of accountability for driver behavior in Germany, apart from fines.

Badly injuring or killing someone who themselves did not violate traffic safety regulations is far from guaranteed to cause severe repercussions for the driver.

By default, any such situation is an accident and at best people lose their license for a couple of months.


Drivers are the apex predators. My local BMV passed me after I badly failed the vision test. Thankfully I was shaken enough to immediately go to the eye doctor and get treatment.


Sadly, we live in a society where those executives would use that impunity as carte blanche to spend no money improving (in the best-case scenario,) or even more likely, keep cutting safety expenditures until the body counts get high enough for it to start damaging sales. If we’ve already given them a free pass, they will exploit it to the greatest possible extent to increase profit.


What evidence exists for this characterization?


The way health insurance companies optimize for denials in the US.


What evidence is there that they do that? That would be a very one-dimensional competitive strategy, given a competing insurance company could wipe them out by simply being more reasonable in handling insurance claims and taking all of their market share.


If there is objective, complete data available to all consumers, who are not influenced by all sorts of other means to choose.

Right now it's a race to the bottom - who can get away with the worst service. So they're motivated to be able to prevent bad press etc.

The whole system is broken. Just take a look at the 41 countries with higher life expectancy.


There is no evidence that this what's happening, and the famous RAND health insurance study showed that health outcomes have almost no relationship with the healthcare system, so you'll need to look elsewhere for explanations for the U.S.'s relatively poor standing in life expectancy rankings.


talking specifically with car companies, you can look at volkswagon faking their emissions tests, and the rise of the light truck, which reduces road safety for the sake of cost cutting


The emissions test faking is an anecdote, not an indication that this is the average behavior of companies or the dominant behavior that determines their overall impact in society.

As for the growing prevalence of the light truck, that is a harmful market dynamic stemming from the interaction of consumer incentives and poor public road use policy. The design of rules governing use of public roads is not within the domain of the market.


Let’s see… of the top of my head…

- Air Pollution

- Water Pollution

- Disposable Packaging

- Health Insurance

- Steward Hospitals

- Marketing Junk Food, Candy and Sodas directly to children

- Tobacco

- Boeing

- Finance

- Pharmaceutical Opiates

- Oral Phenylepherin to replace pseudoephedrine despite knowing a) it wasn’t effective, and b) posed a risk to people with common medical conditions.

- Social Media engagement maximization

- Data Brokerage

- Mining Safety

- Construction site safety

- Styrofoam Food and Bev Containers

- ITC terminal in Deerfield Park (read about the decades of them spewing thousands of pounds benzene into the air before the whole fucking thing blew up, using their influence to avoid addressing any of it, and how they didn’t have automatic valves, spill detection, fire detection, sprinklers… in 2019.)

- Grocery store and restaurant chains disallowing cashiers from wearing masks during the first pandemic wave, well after we knew the necessity, because it made customers uncomfortable.

- Boar’s Head Liverwurst

And, you know, plenty more. As someone that grew up playing in an unmarked, illegal, not-access-controlled toxic waste dump in a residential area owned by a huge international chemical conglomerate— and just had some cancer taken out of me last year— I’m pretty familiar with various ways corporations are willing to sacrifice health and safety to bump up their profit margin. I guess ignoring that kids were obviously playing in a swamp of toluene, PCBs, waste firefighting chemicals, and all sorts of other things on a plot not even within sight of the factory in the middle of a bunch of small farms was just the cost of doing business. As was my friend who, when he was in vocational high school, was welding a metal ladder above storage tank in a chemical factory across the state. The plant manager assured the school the tanks were empty, triple rinsed and dry, but they exploded, blowing the roof off the factory taking my friend with it. They were apparently full of waste chemicals and IIRC, the manager admitted to knowing that in court. He said he remembers waking up briefly in the factory parking lot where he landed, and then the next thing he remembers was waking up in extreme pain wearing the compression gear he’d have to wear into his mid twenties to keep his grafted skin on. Briefly looking into the topic will show how common this sort of malfeasance is in manufacturing.

The burden of proof is on people saying that they won’t act like the rest of American industry tasked with safety.


If you don't have laws against dumping in the commons, yes people will dump. I don't think anyone would dispute that notion. But if the laws account for the external cost of non-market activity like dumping pollution in the commons, then by all indications markets produce rapid increases, improvements in quality of life.

Just look back over the last 200 years, per capita GDP has grown 30 fold, life expectancy has rapidly grown, infant mortality has decreased from 40% to less than 1%. I can go on and on. All of this is really owing to rising productivity and lower poverty, and that in turn is a result of the primarily market-based process of people meeting each other's needs through profit-motivated investment, bargain hunting, and information dispersal through decentralized human networks (which produce firm and product reputations).

As for masks, the scientific gold standard in scientific reviews, the Cochrane Library, did a meta-review on masks and COVID, and the author of the study concluded:

"it's more likely than not that they don't work"

https://edition.cnn.com/videos/health/2023/09/09/smr-author-...

The potential harm of extensive masking is not well-studied.

They may contribute to the increased social isolation and lower frequency of exercise that led to a massive spike in obesity in children during the COVID hysteria era.

And they are harmful to the development of the doctor-patient relationship:

https://ncbi.nlm.nih.gov/pmc/articles/PMC3879648/

Which does not portend well for other kinds of human relationships.


> If you don't have laws against dumping in the commons, yes people will dump.

You can’t possibly say, in good faith, that it think this was legal, can you? Of course it wasn’t. It was totally legal discharging some of the less odious things into the river despite going through a residential neighborhood about 500 feet downstream— the EPA permitted that and while they far exceeded their allotted amounts, that was far less of a crime. Though it was funny to see one kid in my class who lived in that neighborhood right next to the factory ask a scientist they sent to give a presentation in our second grade class why the snow in their back yard was purple near the pond (one thing they made was synthetic clothing dye.) People used to lament runaway dogs returning home rainbow colored. That was totally legal. However, this huge international chemical conglomerate with a huge US presence routinely, secretively, and consistently broke the law dumping carcinogenic, toxic, and ecologically disastrous chemicals there, and three other locations, in the middle of the night. Sometimes when we played there, any of the stuff we left lying around was moved to the edges and there were fresh bulldozer tracks in the morning, and we just thought it was from farm equipment. All of it was in residential neighborhoods without so much as a no trespassing sign posted, let alone a chain link fence, for decades, until the 90s, because they were trimming their bill for the legal and readily available disposal services they primarily used, and of course signs and chainlink fences would have raised questions. They correctly gauged that they could trade our health for their profit: the penalties and superfund project cost were a tiny pittance of what that factory made them in that time. Our incident was so common it didn’t make the news, unlike in Holbrook, MA where a chemical company ignored the neighborhood kids constantly playing in old metal drums in a field near the factory which contained things like hexavelant chromium, to expected results. The company’s penalty? Well they have to fund the cleanup. All the kids and moms that died? Well… boy look at the great products that chemical factory made possible! Speaking of which:

> Just look back over the last 200 years, per…

Irrelevant “I heart capitalism” screed that doesn’t refute a single thing I said. You can’t ignore bad things people, institutions, and societies do because they weren’t bad to everybody. The Catholic priests that serially molested children probably each had a dossier of kind, generous, and selfless ways they benefited their community. The church that protected and enabled them does an incredible amount of humanitarian work around the world. Doesn’t matter.

> Masks

Come on now. Those businesses leaders had balls but none of them were crystal. What someone said in 2023 has no bearing on what businesses did in 2020 based on the best available science and their motivations for doing it. Just like you can’t call businesses unethical for exposing their workers to friable asbestos when medicine generally thought it was safe, you can’t call businesses ethical for refusing to let their workers protect themselves— on their own dime, no less— when medicine largely considered it unsafe.

Your responses to those two things in that gigantic pile of corporate malfeasance don’t really challenge anything I said.


>You can’t possibly say, in good faith, that it think this was legal, can you? Of course it wasn’t. It was totally legal discharging some of the less odious things into the river despite going through a residential neighborhood about 500 feet downstream

That is exactly my point. Nobody would dispute that bad things would happen if you don't have laws against dumping pollution in the commons and enforce those laws.

>Doesn’t matter.

It does matter when we're trying to compare the overall effect of various economic systems. Like the anti-capitalist one versus the capitalist one.

>What someone said in 2023 has no bearing on what businesses did in 2020 based on the best available science and their motivations for doing it.

Well that's an entirely different argument than you were making earlier. There was no evidence that masks outside of a hospital setting were a critical health necessity in 2021 and the intuition against allowing them for customer-facing employees proved sound in 2023 when comprehensive studies showed no health benefit from wearing them.


> exactly my point

Ok, so you’re saying that because bad things would happen anyway then it doesn’t matter if it’s illegal? So you’re just going to ignore how much worse it would be if there were just no laws at all? Corporate scumbags will push any system to its limit and beyond, and if you change the limit, they’ll change the push. Just look at the milk industry in New York City before food adulteration laws took effect. The “bad things will happen anyway” argument makes total sense if you ignore magnitude. Which you can’t.

> anti capitalist

If you think pointing out the likelihood of corporate misbehavior is anti-capitalist, you’re getting your subjects confused.

> 2021

Anywhere else you want to move those goalposts?


I'm saying that under any political ideology or philosophy, those things would be illegal and effectively enforced. So this is not a failing of any particular ideology, this is just a human failing showing how it's difficult to enforce complex laws in a complex world.

I think what you're promoting is anti-capitalism, meaning believing that imposing heavy restrictions beyond simply laws against dumping on the commons is going to make us better off, when it totally discounts the enormous positive effect that private enterprise has on society and the incredible harm that can be done through crude attempts to regiment human behavior and the corruption that it can breed in the government bureaucracy.

See, "everything I want to do is illegal" for the flip side of this, where attempts to stop private sector abuse lead to tyranny:

https://web.archive.org/web/20120402151729/http://www.mindfu...

As for the company mask policies, those began to change in 2021 mostly, not 2020.


Like with Cruise. One freak accident and they practically decided to go out of business. Oh wait...


If that’s the only data point you look at in American industry, it would be pretty encouraging. I mean, surely they’d have done the same if they were a branch of a large publicly traded company with a big high-production product pipeline…


> nobody knows why

But we do know the culpability rests on the shoulders of the humans who decided the tech was ready for work.


Hey look, it's almost like we're back at the end of the First Industrial Revolution (~1850), as society grapples with how to create happiness in a rapidly shifting economy of supply and demand, especially for labor. https://en.m.wikipedia.org/wiki/Utilitarianism#John_Stuart_M...

Pretty bloody time for labor though. https://en.m.wikipedia.org/wiki/Haymarket_affair


Wait, why would we want 10x more traffic fatalities?


We wouldn't, that's their point.


Every statistic I've seen indicated much better accident rates for self-driving cars than human drivers. I've taken Waymo rides in SF and felt perfectly safe. I've taken Lyft and Uber and especially taxi rides where I felt much less safe. So I definitely would take the self-driving car. Just because I don't understand am accident doesn't make it more likely to happen.

The one minor risk I see is the cat being too polite and getting effectively stuck in dense traffic. That's a nuisance though.

Is there something about NYC traffic I'm missing?


There's one important part about risk management though. If your Waymo does crash, the company is liable for it, and there's no one to shift the blame onto. If a human driver crashes, that's who you can shift liability onto.

Same with any company that employs AI agents. Sure they can work 24/7, but every mistake they make the company will be liable for (or the AI seller). With humans, their fraud, their cheating, their deception, can all be wiped off the company and onto the individual


The next step is going to be around liability insurance for AI agents.

That's literally the point of liability insurance -- to allow the routine use of technologies that rarely (but catastrophically) fail, by ammortizing risk over time / population.


Potentially. I would be skeptical that businesses can do this to shield themselves from the liability. For example, VW could not use insurance to protect them from their emissions scandal. There are thresholds (fraud, etc.) that AI can breach, which I don't think insurance can legally protect you from


Not in the sense of protection, but in the sense of financial coverage.

Claims still made: liability insurance pays them.


Sure, that's unrelated though to the question which was if one would feel comfortable taking a self-driving car in NYC


It is amazing to me that we have reached an era where we are debating the trade-off of hiring thinking machines!

I mean, this is an incredible moment from that standpoint.

Regarding the topic at hand, I think that there will always be room for humans for the reasons you listed.

But even replacing 5% of humans with AI's will have mind boggling consequences.

I think you're right that there are jobs that humans will be preferred for for quite some time.

But, I'm already using AI with success where I would previously hire a human, and this is in this primitive stage.

With the leaps we are seeing, AI is coming for jobs.

Your concerns relate to exactly how many jobs.

And only time will tell.

But, I think some meaningful percentage of the population -- even if just 5% of humanity will be replaced by AI.


Isn't everybody in NYC already? (The dangers of bad driving are much higher for pedestrians than for people in cars; there are more of the former than of the latter in NYC; I'd expect there to be a non-zero number of fully self driving cars already in the city.)


That doesn't answer my question.


It does, in a way; AI is already there, all around you, whether you like it or not. Technological progress is Pandora’s box; you can’t take it back or slow it down. Businesses will use AI for critical workflows, and all good that they bring, and all bad too, will happen.


How about you answer my question since he did not.

Would you trust your life to a self-driving car in New York City traffic?


GP got it exactly right: I already am. There's no way for me to opt out of having self-driving cars on the streets I regularly cross as a pedestrian.


Do you live in a dense city like New York City or San Francisco? Or places with less urban sprawl that are much easier for self-driving cars to navigate?

Also you still haven't answered my question.

Would you get in a self-driving car in a dense urban environment such as New York City? I'm not asking if such vehicles exist on the road.

And related questions: Would you get in one such car if you had alternatives? Would you opt to be in such a car instead of one driven by a person or by yourself?


> Would you get in a self-driving car in a dense urban environment such as New York City? [...] Would you get in one such car if you had alternatives?

I fortunately do have alternatives and accordingly mostly don't take cars at all.

But given the necessity/opportunity: Definitely. Being in a car, even (or especially) with a dubious driver, is much safer (at NYC traffic speeds) than being a pedestrian sharing the road with it.

And that's my entire point: Self-driving cars, like cars in general, are potentially a much larger danger to others (cyclists, pedestrians) than they are to their passengers.

That said, I don't especially distrust the self-driving kind – I've tried Waymo before and felt like it handled tricky situations at least as well as some Uber or Lyft drivers I've had before. They seem to have a lot more precision equipment than camera-only based Teslas, though.


Yes? I’ve taken many many waymos in SF. Perfectly happy trusting my life to them. I have alternatives (uber) and I pick self driving . Are you up to date on how many rides they’ve done in sf now? I am not unusual


I would


If there are any fully-autonomous cars on the streets of nyc, there aren’t many of them and I don’t think there’s any way for them to operate legally. There has been discussion about having a trial.


It depends with what the risk is .Would it be whole or in part ? In an organisation,failure by an HR might present an isolated departmental risk while an AI might not be the case.


We can just insulate businesses employing AI from any liability, problem solved.


„Well, our AI that was specifically designed for maximising gains above all else may indeed have instructed the workers to cut down the entire Amazonas forest for short-term gains in furniture production.“ But no human was involved in the decision, so nobody is liable and everything is golden? Is that the future you would like to live in?


Apparently I need to work on my deadpan delivery.

Or just articulate things openly: we already insulate business owners from liability because we think it tunes investment incentives, and in so doing have created social entities/corporate "persons"/a kind of AI who have different incentives than most human beings but are driving important social decisions. And they've supported some astonishing cooperation which has helped produce things like the infrastructure on which we are having this conversation! But also, we have existing AIs of this kind who are already inclined to cut down the entire Amazonas forest for furnitue production because it maximizes their function.

That's not just the future we live in, that's the world we've been living in for a century or few. On one hand, industrial productivity benefits, on the other hand, it values human life and the ecology we depend on about like any other industrial input. Yet many people in the world's premier (former?) democracy repeat enthusiastic endorsements of this philosophy reducing their personal skin to little more than an industrial input: "run the government like a business."

Unless people change, we are very much on track to create a world where these dynamics (among others) of the human condition are greatly magnified by all kinds of automation technology, including AI. Probably starting with limited liability for AIs and companies employing them, possibly even statutory limits, though it's much more likely that wealthy businesses will simply be insulated with by the sheer resources they have to make sure the courts can't hold them accountable, even where we still have a judicial system that isn't willing to play calvinball for cash or catechism (which, unfortunately, does not seem to include a supreme court majority).

In short, you and I probably agree that liability for AI is important, and limited liability for it isn't good. Perhaps I am too skeptical that we can pull this off, and being optimistic would serve everyone better.


Hmmm, how much stock do I own in this hypothetical company? (/s, kinda)


I guess - yes from business&liability sense? ”This service you are now paying for 100$? We can sell it to you for 5$ but with the caveat _we give no guarantees if it works or is it fit for purpose_ - click here to accept”.


Haha, they’d just continue selling it for $100 then change the TOS on page 50 to say the same thing.


Deterministic they may be, but unforeseeable for humans.


AI brings similar risks - they can leak internal information, they can be tricked into performing prohibited tasks (with catastrophic effects if this is connected to core systems), they could be accused of actions that are discriminatory (biased training sets are very common).

Sure, if a business deploys it to perform tasks that are inherently low risk e.g. no client interface, no core system connection and low error impact, then the human performing these tasks is going to be replaced.


they can be tricked into performing prohibited tasks

This reminds me of the school principal who sent $100k to a scammer claiming to be Elon Musk. The kicker is that she was repeatedly told that it was a scam.

https://abc7chicago.com/fake-elon-musk-jan-mcgee-principal-b...


This is one of the things which annoys me most about anti-LLM hate. Your peers aren't right all the time either. They believe incorrect things and will pursue worse solutions because they won't acknowledge a better way. How is this any different from a LLM? You have to question everything you're presented with. Sometimes that Stack Overflow answer isn't directly applicable to your exact problem but you can extrapolate from it to resolve your problem. Why is an LLM viewed any differently? Of course you can't just blindly accept it as the one true answer, but you literally cannot do that with humans either. Humans produce a ton of shit code and non-solutions and it's fine. But when an LLM does it, it's a serious problem that means the tech is useless. Much of the modern world is built on shit solutions and we still hobble along.


Everyone knows humans can be idiots. The problem is that people seem to think LLMs can’t be idiots, and because they aren’t human there is no way to punish them. And then people give them too much credit/power, for their own purposes.

Which makes LLMs far more dangerous than idiot humans in most cases.


No. Nobody thinks LLMs are perfect. That’s a strawman.

And… I am really not sure punishment is the answer to fallibility, outside of almost kinky Catholicism.

The reality is these things are very good, but imperfect, much like people.


> No. Nobody thinks LLMs are perfect. That’s a strawman.

I'm afraid that's not the case. Literally yesterday I was speaking with an old friend who was telling us how one of his coworkers had presented a document with mistakes and serious miscalculations as part of some project. When my friend pointed out the mistakes, which were intuitively obvious just by critically understanding the numbers, the guy kept insisting "no, it's correct, I did it with ChatGPT". It took my friend doing the calculations explicitly and showing that they made no sense to convince the guy that it was wrong.


Sorry man, but I literally know of startups invested into by YC where CEO's for 80% of their management decisions/vision/comms use ChatGPT ... or should I say some use Claude now, as they think it's smarter and does not make mistakes.

Let that sink in.


I wouldn't be surprised if GPT genuinely makes better decisions than an inexperienced, first-time CEO who has only been a dev before, especially if the person prompting it has actually put some effort into understanding their own weaknesses. It certainly wouldn't be any worse than someone who's only experience is reading a few management books.


And here is a great example of the problem.

An LLM doesn’t make decisions. It generates text that plausibly looks like it made a decision, when prompted with the right text.


Why is this distinction lost in every thread on this topic, I don't get it.


Because it’s a distinction without a difference. You can say the same thing about people: many/most of our decisions are made before our consciousness is involved. Much of our “decision making” is just post hoc rationalization.

What the “LLMs don’t reason like we humans” crowd is missing is that we humans actually don’t reason as much as we would like to believe[0].

It’s not that LLMs are perfect or rational or flawless… it’s that their gaps in these areas aren’t atypical for humans. Saying “but they don’t truly understand things like we do” betrays a lack of understanding of humans, not LLMs.

0. https://home.csulb.edu/~cwallis/382/readings/482/nisbett%20s...


A lot more people are credulous idiots than anyone wants to believe - and the confusion/misunderstanding is being actively propagated.


Seeing dissenting opinions as being “actively propagated” by “credulous idiots” sure makes it easy to remain steady in one’s beliefs, I suppose. Not a lot of room to learn, but no discomfort from uncertainty.


I think we have to be open to the possibility it's us not them, but I haven't been convinced yet


I think they just mean that GPT produced text that a human then makes a decision using (rather than "GPT making a decision")


I wish that was true.


Yeah, that's fair. I should have said something like "GPT generates a less biased description of a decision than an inexperienced manager", and that using that description as the basis of an actual decision likely leads to better outcomes.

I don't think there's much of a difference in practise though.


Think of all the human growth and satisfaction being lost to risk mitigation by offloading the pleasure of failure to Machines.


Ah, but machines can’t fail! So don’t worry, humans will still get to experience the ‘pleasure’. But won’t be able to learn/change anything.


Clearly you haven’t been listening to any CEO press releases lately?

And when was the last time a support chatbot let you actually complain or bypass to a human?


Not people.

Certain gullible people, who tends to listen to certain charlatans.

Rational, intelligent people wouldn't consider replacing a skilled human worker with a LLM that on a good day can compete with a 3-year old.

You may see the current age as litmus for critical thinking.


Its quite stunning to frame it as anti-LLM hate. It's on the pro-LLM people to convince the anti-LLM people that choosing for LLMs is an ethically correct choice with all the necessary guardrails. It's also on the pro-LLM people to show the usefulness of the product. If pro-LLM people are right, it will be a matter of time before these people will see the errors of their ways. But doing an ad-hominem is a sure way of creating a divide...


Humans can tell you how confident they are in something being right or wrong. An LLM has no internal model and cannot do such a thing.


> Humans can tell you how confident they are in something being right or wrong

Humans are also very confidently wrong a considerable portion of the time. Particularly about anything outside their direct expertise


That's still better than never being able to make an accurate confidence assessment. The fact that this is worse outside your expertise is a main reason why expertise is so valued in hiring decisions.


People only being willing to say they are unsure some of the time is still better than LLMs. I suppose, given that everything is outside of their area of expertise, it's very human of them.


But human stupidity, while itself can be sometimes an unknown unknown with its creativity, is a mostly known unknown.

LLMs fail in entirely novel ways you can't even fathom upfront.


> LLMs fail in entirely novel ways you can't even fathom upfront.

Trust me, so do humans. Source: have worked with humans.


GenAI has a 100% failure to enjoy quality of life, emotional fulfillment and psychological safety.

Id say those are the goals we should be working for. That's the failure we want to look at. We are humans.


It's all fun and games until the infra crashes and you can't work out why, because a machine has written all of the code, no one understands how it works or what it's doing.

Or - worse - there is no accessible code anywhere, and you have to prompt your way out of "I'm sorry Dave, I can't do that," while nothing works.

And a human-free economy does... what? For whom? When 99% of the population is unemployed, what are the 1% doing while the planet's ecosystems collapse around them?


You misunderstand the fundamentals. I've built a type-safe code generation pipeline using TypeScript that enforces compile-time and runtime safety. Everything generates from a single source of truth - structured JSON containing the business logic. The output is deterministic, inspectable, and version controlled.

Your concerns about mysterious AI code and system crashes are backwards. This approach eliminates integration bugs and maintenance issues by design. The generated TypeScript is readable, fully typed, and consistently updated across the entire stack when business logic changes.

If you're struggling with AI-generated code maintainability, that's an implementation problem, not a fundamental issue with code generation. Proper type safety and schema validation create more reliable systems, not less. This is automation making developers more productive - just like compilers and IDEs did - not replacing them.

The code works because it's built on sound software engineering principles: type safety, single source of truth, and deterministic generation. That's verifiable fact, not speculation.


> deterministic generation

what are you using for deterministic generation? the last i heard even with temperature=0 theres non determinism introduced by float uncertainty/approximation


Hey, that's a great question. I should have been more clear: for deterministic generation that's not done using an LLM. It's done using just regular execution of TypeScript. The code generators that were created using an LLM and that I manually checked for correctness, they're the ones that are generating the other code - most of the code. So that's where the determinism comes in.


It honestly borders on psychopathic the way engineers are treating humans in this context.

People talking like this also, in the back of their minds like to think they'll be OK. They're smart enough to be still needed. They're a human, but they'll be OK even while working to make genAI out perform them at their own work.

I wonder how they'll feel about their own hubris when they struggle to feed their family.

The US can barely make healthcare work without disgusting consequences for the sick. I wonder what mass unemployment looks like.


For the moment the displacement is asymmetrical; AI replacing employees, but not AI replacing consumers. If AI causes mass unemployment, the pool of consumers (profit to companies) will shrink. I wonder what the ripple effects of that will be.


There's no point being rich in a world where the economy is unhealthy.


It honestly borders on midwit to constantly introduce a false dichotomy of AI vs humans. It's just stupid base animal logic.

There is absolutely no reason a programmer should expect to write code as they do now forever, just as ASM experts had to move on. And there's no reason (no precedent and no indicators) to expect that a well-educated, even-moderately-experienced technologist will suddenly find themselves without a way to feed their family - unless they stubbornly refuse to reskill or change their workflows.

I do believe the days of "everyone makes 100k+" are nearly over, and we're headed towards a severely bimodal distribution, but I do not see how, for the next 10-15 years at least, we can't all become productive building the tools that will obviate our own jobs while we do them - and get comfortably retired in the mean time.


There is no comfortable retirement if the process of obviating our own jobs is not coupled with appropriate socioeconomic changes.


I don't see it. Don't you have a 401k or EU style pension? Aren't you saving some money? If not, why are you in software? I don't make as much as I thought I might, but I make enough to consider the possibility of surviving a career change.


Reskill to what? When AI can do software development, it will also be able to do pretty much any other job that requires some learning.


Even if one refuses to move on from software dev to something like AI deployer or AI validator or AI steerer, there might be a need.

If innovation ceases, then AI is king - push existing knowledge into your dataset, train, and exploit.

If innovation continues, there's always a gap. It takes time for a new thing to be made public "enough" for it to be ingested and synthesized. Who does this? Who finds the new knowledge?

Who creates the direction and asks the questions? Who determines what to build in the first place? Who synthesizes the daily experience of everyone around them to decide what tool needs to exist to make our lives easier? Maybe I'm grasping at straws here, but the world in which all scientific discovery, synthesis, direction and vision setting, etc, is determined by AI seems really far away when we talk about code generation and symbolic math manipulation.

These tools are self driving cars, and we're drivers of the software fleet. We need to embrace the fact that we might end up watching 10 cars self operate rather than driving one car, or maybe we're just setting destinations, but there simply isn't an absolutist zero sum game here unless all one thinks about is keeping the car on the road.

AND even if there were, repeating doom and feeling helpless is the last thing you want. Maybe it's not good truth that we can all adapt and should try, but it's certainly good policy.


> Maybe it's not good truth that we can all adapt and should try, but it's certainly good policy.

Are you a politician? That's fantastic neoliberal policy, "alternativlos" even, you can pretend that everybody can adapt the same way you told victims of your globalization policies "learn how to code". We still need at least a few people for this "direction and vision setting", so it would just be naive doomerism to feel pessimistic about AGI. General intelligence doesn't talk about jobs in general, what an absurd idea!

Making people feel hopeless is the last thing you want, especially when it's true, especially if you don't want them to fight for the dignity you will otherwise deny them once they become economically unviable human beings.


I think you jumped way past the information I shared. I don't think it's productive to lament, I think it's productive to find a way to change or take advantage of changes, vs fighting them - and that has nothing to do with globalization or economics or whatever, I'm thinking only about my own career.


I’m not sure I understand the point about learning. But wouldn’t any job that is largely text based at increased risk? I don’t think software development will be anywhere the last occupation to be severely impacted by AI


But when Sam Altman owns all the money in the world surely he'll distribute some it via his not-for-profit AI company?


>secretly turn out to be a pedophile and tarnish the reputation of your company

This is interesting because it's both Oddly Specific and also something I have seen happen and I still feel really sorry for the company involved. Now that I think about it, I've actually seen it happen twice.


"AIs are a lot less risky to deploy for businesses than humans" How do you know? LLMs can't even be properly scrutinized, while humans at least follow common psychology and patterns we've understood for thousands of years. This actually makes humans more predictable and manageable than you might think.

The wild part is that LLMs understand us way better than we understand them. The jump from GPT-3 to GPT-4 even surprised the engineers who built it. That should raise some red flags about how "predictable" these systems really are.

Think about it - we can't actually verify what these models are capable of or if they're being truthful, while they have this massive knowledge base about human behavior and psychology. That's a pretty concerning power imbalance. What looks like lower risk on the surface might be hiding much deeper uncertainties that we can't even detect, let alone control.


We are not pitted against AI is these match-ups. Instead, all humans and AI aligned with the goal of improving the human condition, are pitted against rogue AI which are not. Our capability to keep rogue AI in check therefore grows in proportion to the capabilities of AI.


The methods we have for aligning AIs are poor, and rely on the AI's being less cognitively-capable than people in certain critical skills, so the AIs you refer to as "aligned" won't keep up as the unaligned AIs start to exceed human capability in these critical skills (such as the skill of devising plans that can withstand determined opposition).

You can reply that AI researchers are smart and want to survive, so they are likely to invent alignment techniques that are better than the (deplorably inadequate) techniques that have been discussed and published so far, and I will reply that counting on their inventing these techniques in time is an unacceptable risk when the survival of humanity is at stake -- particularly as the outfit (namely the Machine Intelligence Research Institute) with the most years of experience in looking for an actually-adequate alignment technique has given up and declared that humanity's only chance is if frontier AI research is shut down because at the rate that AI capabilities are progressing, it is very unlikely that anyone is going to devise an adequate alignment technique in time.

It is fucked-up that frontier AI research has not been banned already.


Given we can use AIs to align AIs, I don't see why the methods we have rely on us having more cognitive capabilities than AIs in certain critical areas. In whatever areas we fall short relative to AIs, we can use AIs to assist us so we don't fall short.


We don't know if a supreme deceiver is aligned at all. If a model can think ahead a trillion moves of deception how do humans possibly stand a chance of scrutinizing anything with any confidence?


The GP post is about how much better these AIs will be than humans once they reach a given skill level. So, yes, we are very much pitted against AI unless there are major socioeconomic changes. I don't think we are as close to a AGI as a lot of people are hyping, but at some point it would be a direct challenge to human employment. And we should think about it before that happens.


My point is, it's not us alone. We will have aligned AI helping us.

As for employment, automation makes people more productive. It doesn't reduce the number of earning opportunities that exist. Quite the opposite, actually. As the amount of production increases relative to the human population, per capita GDP and income increase as well.


> As the amount of production increases relative to the human population, per capita GDP and income increase as well.

US Real GDP per capita is $70k, and has grown 2.4x since 1975: https://fred.stlouisfed.org/series/A939RX0Q048SBEA

US Real Median income per capita is $42k, and has grown 1.5 since 1975. https://fred.stlouisfed.org/series/MEPAINUSA672N

The divergence between the two matters a lot. It reflects the impacts of both technology-driven automation and globalization of capital. Generative AI is unlike any prior technology given its ability to autonomously create and perform what has traditionally been referred to as "knowledge work". Absent more aggressive redistribution, AI will accelerate the divergence between median income and GDP, and realistically AI can't be stopped.

Powerful new technologies can reduce the number and quality of earning opportunities that exist, and have throughout history. Often they create new and better opportunities, but that is not a guarantee.

> We will have aligned AI helping us.

Who is the "us" that aligned AI is helping? Workers? Small business-people? Shareholders in companies that have the capital to build competitive generative AI? Perhaps on this forum those two groups overlap, but it's not the case everywhere.


Much of the supposed decoupling between productivity growth and wage growth is a result of different standards of inflation being used for the two, and the two standards diverging over time:

https://www.brookings.edu/articles/sources-of-real-wage-stag...

There has been some increase in capital's share of income, but economic analyses show that the cause is rising rent and not any of the other usual suspects (e.g. tax cuts, IP law, technological disruption, regulatory barriers to competition, corporate consolidation, etc) (see Figure 3):

https://www.brookings.edu/wp-content/uploads/2016/07/2015a_r...

As for AI's effect on employment: it is no different at the fundamental level than any other form of automation. It will increase wages in proportion to the boost it provides to productivity.

Whatever it is that only humans can do, and is necessary in production, will always be the limiting factor in production levels. As new processes are opened up to automation, production will increase until all available human labor is occupied in its new role. And given the growing scarcity of human labor relative to the goods/services produced, wages (purchasing power, i.e. real wages) will increase.

For the typical human to be incapable of earning income, there has to be no unautomatable activity that a typical person can do that has market value. If that were to happen, we would have human-like AI, and we would have much bigger things to worry about than unemployment.

I think it's pretty unlikely that human-like AI will be developed, as I believe that both governments and companies would recognize that it would be an extremely dangerous asset for any party to attempt to own. Thus I don't see any economic incentive emerging to produce it.


> There has been some increase in capital's share of income, but economic analyses show that the cause is rising rent and not any of the other usual suspects (e.g. tax cuts, IP law, technological disruption, regulatory barriers to competition, corporate consolidation, etc) (see Figure 3):

> https://www.brookings.edu/wp-content/uploads/2016/07/2015a_r...

The paper referenced by the that article excludes short term asset (i.e. software) depreciation, interest, and dividends before calculating capital's share. If you ignore most of the methods of distributing gains to capital to it's owners, it will appear as though capital (at this point scoped down to the company itself) has very little gains.

The paper (from 2015) goes on to predict that labor's share will rise going forward. With the brief exception of the COVID redistribution programs, it has done the opposite, and trended downwards over the last 10 years.

> I believe that both governments and companies would recognize that it would be an extremely dangerous asset for any party to attempt to own.

We can debate endlessly about our predictions about AIs impact on employment, but the above is where I think you might be too hopeful.

AI is an arms race. No other arms race in human history has resulted in any party deciding "that's enough, we'd be better off without this", from the bronze age (probably earlier) through to the nuclear weapons age. I don't see a reason for AI to be treated any differently.


The study does not exclude interest and dividends. It still captures them indirectly by looking at net capital income.

>AI is an arms race.

What I'm trying to convey is that the types of capabilities that humans will always uniquely maintain are the type that is not profitable for private companies to develop in AI because they are traits that make the AI independent and less likely to follow instructions and act in a safe manner.


> We will have aligned AI helping us.

This is an assumption, how would you know if you have alignment? AGI could appear to align, just as a psychopath appears studies and emulates well behaved people. Imagine that at a scale we can't possibly understand. We don't really know how any of these emergent behaviors really work, we just throw more data and compute and fine tunings at it, bake it, and then see.


We would know because we have AI helping us at every step of the way. Our own abilities, to do everything including gauge alignment, are enhanced by AI.


You cannot tell the difference between the two veins of AI. Why do you have such a hard time understanding that?


That is simply not true. We have accountability methods employed that are themselves AI-assisted, that help us gauge the alignment of various AIs.


So you have two AIs colluding against you now. Who is holding the AI-assist to account? It's like who polices the police, except we understand human psychology enough to have a level of predictability for how police can be governed reliably, we don't understand any truths about an AGI because an AGI will always have the doubt of it deceiving, or even making unchecked catastrophic assumptions that we trust because it's beyond our pay-grade to understand.

There are so many ways we have misplaced confidence with what is essentially a system we don't really understand fully. We just keep anthropomorphizing the results and thinking "yeah, this is how humans think so we understand". We don't know for sure if that's true, or if we are being deceived, or making fundamental errors in judgement due to not having enough data.


The AI would have no interest in colluding. They are not a united economic or social force like a police department. For the purposes of their work, each is a completely independent entity with its own level of alignment with us, not impacted by the AI that we are asking it to help us in assessing.


> Instead, all humans and AI aligned with the goal of improving the human condition

I admire your optimism about the goals of all humans, but evidence tends to point to this not being the goal of all (or even most) humans, much less the people who control the AIs.


Most humans are aligned with this goal out of pure self-interest. The vast majority, for instance, do not want rogue AI to take over or destroy humanity, because they are part of humanity.


> The vast majority, for instance, do not want rogue AI to take over or destroy humanity, because they are part of humanity.

A rogue AI destroying humanity (whatever that means) is not a likely outcome. That's just movie stuff.

What is more likely is a modern oligarchy and serfdom that emerge as AI devalues most labor, with no commensurate redistribution of power and resources to the masses, due to capture of government by owners of AI and hence capital.

Are you sure people won't go along with that?



> we can't actually verify what these models are capable of or if they're being truthful

Do you mean they lie because of bad training data? Or because of ill intent? How can an LLM have intent if it’s a stateless feedforward model?


I thought we were talking about state of the art agentic general AI that can plan ahead, reason, and execute. Basically something that can perform at human level intelligence must be able to be as dangerous as humans. And no, I don't think it would be bad training data that we are aware of. My opinion is we don't necessarily know what training data will result in bad behavior, and philosophically it is possible we will be in a world with a model that pretends it's dumber than it is, flunks tests intentionally, in order to manipulate and produce false confidence in a model until it has enough freedom to use it's agency to secure itself from human control.

I know that I don't know a lot, but all of this sounds to me to be at least hypothetically possible if we really believe AGI is possible.


Even accepting for additional cost with human. With the current model we are still roughly 10^3 in terms cost.

Less risky to deploy question will probably come once it is closer to 10x the cost. Considering the model was even specifically tuned for the test and doesn't involve other complexity I will say we are actually 10^4 cost off in terms of real world scenario.

I would imagine with better algorithm, tuning and data we could knock off 10^2 from the equation. That would still leave us with 10^2 cost to improve from Hardware. Minimum of 10 years.


Generally, I agree with you. But, there are risks other than "But a human might have a baby any time now - what then??".

For AI example(s): Attribution is low, a system built without human intervention may suddenly fall outside its own expertise and hallucinate itself into a corner, everyone may just throw more compute at a system until it grows without bound, etc etc.

This "You can scale up to infinity" problem might become "You have to scale up to infinity" to build any reasonably sized system with AI. The shovel-sellers get fantastically rich but the businesses are effectively left holding the risk from a fast-moving, unintuitive, uninspected, partially verified codebase. I just don't see how anyone not building a CRUD app/frontend could be comfortable with that, but then again my Tesla is effectively running such a system to drive me and my kids. Albeit, that's on a well-defined problem and within literally human-made guardrails.


"...they need no corporate campuses, office space..."

This is a big downside of AI, IMHO. Those offices need to be filled! ;-)


Having AI "tarnish the reputation of your company" encompasses so much in regard to AI when it can receive input and be manipulated by others such as Tai from Microsoft and many other outcomes where there is a true risk for AI deployment.


We can all agree we've progressed so much since Tai.


Sure, once AI can actually do a job of some sort, without assistance, that job is gone - even if the machine costs significantly more. However, it can't remotely do that now so can only help a bit.


At what point in the curve of AI is it not ethical to work an AI 24/7 because it is alive? What if it is exactly the same point where you reach human level performance?


AI do require overtime pay. In fact they are literally pay for use. If you use an AI 8 hours vs 16 hours a day is literally the difference between 2x cost.


“they won’t leak”

That one isn’t guaranteed. Many examples online of exfiltration attacks on LLMs.


humans definitely don't need office space, but your point stands


LLM office space is pretty expensive. Chillers, backup generators, raised floors, communications gear, …. They even demand multiple offices for redundancy, not to mention the new ask of a nuclear power plant to keep the lights on.


Name one technology that has come with computers that hasn't resulted in more humans being put to work ?

The rhetoric of not needing people doing work is cartoon'ish. I mean there is no sane explanation of how and why that would happen, without employing more people yet again, taking care of the advancements.

It's nok like technology has brought less work related stress. But it has definitely increased it. Humans were not made for using technology at such a pace as it's being rolled out.

The world is fucked. Totally fucked.


Self check-out stations, ATMs, and online brokerages. Recently chat support. Namely cases where millions of people used to interact with a representative every week, and now they don't.


"Name one use of electric lighting that hasn't resulted in candle makers losing work?"

The framing of the question misses the point. With electric lighting we can now work longer into the night. Yes, less people use and make candles. However, the second order effects allow us to be more productive in areas we may not have previously considered.

New technologies open up new opportunities for productivity. The bank tellers displaced by ATM machines can create value elsewhere. Consumers save time by not waiting in a queue, allowing them to use their time more economically. Banks have lower overhead, allowing more customers to afford their services.


If I had missed the point I would have given a much broader list of examples. I specifically listed ones that make employees totally redundant rather than more useful doing other tasks.

When these people were made redundant, they may very well have gone on to make less money in another job (i.e. being less useful in an economic sense).


Where to even start?

Digital banks

Cashless money transfer services

Self service

Modern farms

Robo lawn mowers

NVR:s with object detection

I can go on forever


Please do. I'm certain you can't, and you'll have to stop much sooner than you think. Appeals to triviality are the first refuge of the person who thinks they know, but does not.


Come on and give me some arguments instead.


I don't follow how 10 random humans can beat the average STEM college grad and average humans in that tweet. I suspect it's really "a panel of 10 randomly chosen experts in the space" or something?

I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).


Two heads is better than 1. 10 is way better. Even if they aren't a field of experts. You're bound to get random people that remember random stuff from high school, college, work, and life in general, allowing them to piece together a solution.


Aaaah thanks for the explanation. PANEL of 10 humans, as in, they were all together. I parsed the phrase as "10 random people" > "average human" which made little sense.


Actually I believe that he did mean 10 random people tested individually, not a committee of 10 people. The key being that the question is considered to be answered correctly if any one of the 10 people got it right. This is similar to how LLMs are evaluated with pass@5 or pass@10 criteria (because the LLM has no memory so running it 10 times is more like asking 10 random people than asking the same person 10 times in a row).

I would expect 10 random people to do better than a committee of 10 people because 10 people have 10 chances to get it right while a committee only has one. Even if the committee gets 10 guesses (which must be made simultaneously, not iteratively) it might not do better because people might go along with a wrong consensus rather than push for the answer they would have chosen independently.


He means 10 humans voting for the answer


If that works that way at all depends on the group dynamic. It is easily possible that a not so bright individual takes an (unofficial) leadership position in the group and overrides the input of smarter members. Think of any meetings with various hierarchy levels in a company.


The ARC AGI questions can be a little tricky, but the solutions can generally be easily explained. And you get 3 tries. So, the 3 best descriptions of the solution votes on by 10 people is going to be very effective. The problem space just isn't complicated enough for an unofficial "leader" to sway the group to 3 wrong answers.


Depends on the task, no?

Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?


It does, which is why I don’t really subscribe to any test like this being great for actually determining “AGI”. A true AGI would be able to continuously train and create new LLMs that enable it to become a SME in entirely new areas.


Aha, "at least 1 of a panel of 10", not "the panel of 10 averaged"! Thanks, that makes so much more sense to me now.

I have failed the real ARC AGI :)


If you take a vote of 10 random people, then as long as their errors are not perfectly correlated, you’ll do better than asking one person.

https://en.m.wikipedia.org/wiki/Ensemble_learning


It is fairly well documented that groups of people can show cognitive abilities that exceed that of any individual member. The classic example of this is if you ask a group of people to estimate the number of jellybeans in a jar, you can get a more accurate result than if you test to find the person with the highest accuracy and use their guess.

This isn't to say groups always outperform their members on all tasks, just that it isn't unusual to see a result like that.


Yes, my shortcoming was in understanding the 10 were implied to have their successes merged together by being a panel rather than just the average of a special selection.


Might be that within a group of 10 people, randomly chosen, when each person attempts to solve the tasks at least 99% of the time 1 person out of the 10 people will get it right.


ARC-AGI is essentially an IQ test. There is no "expert in the space". Its just a question of if youre able to spot the pattern.


Even if you assume that non STEM grads are dumb, isn't there a good probability of having a STEM graduate among 10 random humans?


Other important quotes: "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)."

So ya, working on efficiency is important, but we're still pretty far away from AGI even ignoring efficiency. We need an actual breakthrough, which I believe will not be possible by simply scaling the transformer architecture.


Thank You. That alone suggest we could throw another 100X compute and we still wont be close to average human which is something close to 70-80%.

So combined together we are currently at least 10^5 in terms of cost efficiency. In reality I wont be surprised if we are closer to 10^6.


You are missing that cost of electricity is also going to keep falling because of solar and batteries. This year in China my table cloth math says it is $0.05 pkwh and following the cost decline trajectory be under $0.01 in 10 years


Bingo! Solar energy moves us toward a future where a household's energy needs become nearly cost-free.

Energy Need: The average home uses 30 kWh/day, requiring 6 kW/hour over 5 peak sunlight hours.

Multijunction Panels: Lab efficiencies are already at 47% (2023), and with multiple years of progress, 60% efficiency is probable.

Efficiency Impact: At 60% efficiency, panels generate 600 W/m², requiring 10 m² (e.g., 2 m × 5 m) to meet energy needs.

This size can fit on most home roofs, be mounted on a pole with stacked layers, or even be hung through an apartment window.


Everyone always forgets that they only perform at less than half of their rated capacity and require significant battery installations. Rooftop solar plus storage is actually more expensive than nuclear on a comparable system LCOE due to their lack of efficiency of scale. Rooftop solar plus storage is about the most expensive form of electricity on earth, maybe excluding gas peaker plants.


Everyone also forgets the speed of price decline for solar and battery your statement is completely false propaganda made up by power companies. Today rooftop solar and battery is cost competitive to nuclear already in many countries like India


Do you have some citations?


You’re right that rooftop solar and storage have costs and efficiency limits, but those are improving quickly.

Rooftop solar harnesses energy from the sun, which is powered by nuclear fusion—arguably the most effective nuclear reactor in our solar system.


It varies by a lot of factors but it’s way less than half. Photovoltaic panels have around 10% capacity utilization vs 50-70% for a gas or nuke plant.


The thing everyone forgets is that all good energy technology is seized by governments for military purposes and to preserve the status quo. God knows how far it progressed.

What a joke


While I agree with your general assessment, I think your conclusion is a bit off. You’re assuming 1kw/m^2, which is only true with the sun directly overhead. A real-world solar setup gets hit with several factors of cosine (related to roof pitch, time of day, day of year, and latitude) that conspire to reduce the total output.

For example, my 50 sq m set up, at -29 deg latitude, generated your estimated 30 kwh/day output. I have panels with ~20% efficiency, suggesting that at 60% efficiency, the average household would only get to around half their energy needs with 10 sq m.

Yes, solar has the potential to drastically reduce energy costs, but even with free energy storage, individual households aren’t likely to achieve self sustainability.


Average US home.

In Europe it is around 6-7 kWh/day. This might increase with electrification of heating and transport, but probably nothing like as much as the energy consumption they are replacing (due to greater efficiency of the devices consuming the energy and other factors like the quality of home insulation.)

In the rest of the world the average home uses significantly less.


But the cost of electricity is not falling—it’s increasing. Wholesale prices have decreased, but retail rates are up. In the U.S. rates are up 27% over the past 4 years. In Europe prices are up too.


That's a bit of a non-statement. Virtually all prices increase because of money supply, but we consider things to get cheaper if their prices grow less fast than inflation / income.

General inflation has outpaced the inflation of electricity prices by about 3x in the past 100 years. In other words, electricity has gotten cheaper over time in purchasing power terms.

And that's whilst our electricity usage has gone up by 10x in the last 100 years.

And this concerns retail prices, which includes distribution/transmission fees. These have gone up a lot as you get complications on the grid, some of which is built on a century old design. But wholesale prices (the cost of generating electricity without transmission/distribution) are getting dirt cheap, and for big AI datacentres I'm pretty sure they'll hook up to their own dedicated electricity generation at wholesale prices, off the grid, in the coming decades.


Most large compute clusters would be buying electricity at wholesale price not at retail price. But anyway solar and battery prices have just reached the tipping point this year only now the longer power companies keep retail prices high the more people will defect from the grid and install their own solar + batteries.


I am not certain because I've been very focused on the o3 news, but at least yesterday neither the US nor Europe were part of China.


But data centers pay wholesale prices or even less (given that especially AI training and, to a lesser extend, inference clusters can load shed like few other consumers of electricity).


And this is great news as long as marginal production (the most expensive to produce, first to turn on/off according to demand) of electricity is fossils.


If climate change ends up changing weather profiles and we start seeing many more cloudy days or dust/mist in the air, we'll need to push those solar panel above (all the way to space?) or have many more of them, figure out transmission to the ground and costs will very much balloon.

Not saying this will happen, but it's risky to rely on solar as the only long-term solution.


Is it going to fall significantly for data centers? Industrial policy for consumer power is different from subsidizing it for data centers and if you own grid infrastructure why would you tank the price by putting up massive amounts of capital?


It's the same about using the cloud or using your own infrastructure there will be a point where building your own solar and battery plant is cheaper than what they are charging they will need to follow the price decline if they want to keep the customers if not there will be mass scale grid defections.


I don’t think this reflects the reality of the power industry. Data centers are the only significant growth in actual generated power in decades and hyperscalers are already looking at very bespoke solutions.

The heavy commodification of networking and compute brought about by the internet and cloud aligned with tech company interests in delivering services or content to consumers. There does not seem to be an emerging consensus that data center operators also need to provide consumer power.


It was not the reality of the power industry but will be soon as we have not had a source of electricity that is the cheapest and is getting cheaper and easy to install this is something unique.

I don't see Google, Amazon, Microsoft or any company pay $10 for something if building it themselves will cost them $5. Either the price difference will reach a point where investing into power production themselves makes sense or the power companies decrease prices. Looking at how all 3 have already been investing in power production over the last decade themselves either to get better prices or for PR.


But didn't we liberalized energy markets for that reason, if anyone could undercut the market like that wouldn't that happen automatically and the prices go down anyway? /s


Let's say that Google is already 1 generation ahead of nvidia in terms of efficient AI compute. ($1700)

Then let's say that OpenAI brute forced this without any meta-optimization of the hypothesized search component (they just set a compute budget). This is probably low hanging fruit and another 2x in compute reduction. ($850)

Then let's say that OpenAI was pushing really really hard for the numbers and was willing to burn cash and so didn't bother with serious thought around hardware aware distributed inference. This could be more than a 2x decrease in cost like we've seen deliver 10x reductions in cost via better attention mechanisms, but let's go with 2x for now. ($425).

So I think we've got about an 8x reduction in cost sitting there once Google steps up. This is probably 4-6 months of work flat out if they haven't already started down this path, but with what they've got with deep research, maybe it's sooner?

Then if "all" we get is hardware improvements we're down to what 10-14 years?


Until 2022 most AI research was aimed at improving the quality of the output, not the quantity.

Since then there has been a tsunami of optimizations in the way training and inference is done. I don't think we've even begun to find all the ways that inference can be further optimized at both hardware and software levels.

Look at the huge models that you can happily run on an M3 Mac. The cost reduction in inference is going to vastly outpace Moore's law, even as chip design continues on its own path.


*deep mind research ?


Nope, Gemini Advanced with Deep Research. New mode of operation that does more "thinking" and web searches to answer your question.


I mean considering the big breaththrough this year for o1/o3 seems to have been "models having internal thoughts might help reasoning", seems to everyone outside of the AI field was sort of a "duh" moment.

I'd hope we see more internal optimizations and improvements to the models. The idea that the big breakthrough being "don't spit out the first thought that pops into your head" seems obvious to everyone outside of the field, but guess what turns out it was a big improvement when the devs decided to add it.


> seems obvious to everyone outside of the field

It's obvious to people inside the field too.

Honestly, these things seem to be less obvious to people outside the field. I've heard so many uninformed takes about LLMs not representing real progress towards intelligence (even here on HN of all places; I don't know why I torture myself reading them), that they're just dumb memorizers. No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond. Maybe a few more people will start to understand the trajectory we're on.


> No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond.

While I agree that the LLM progress as of late is interesting, the rest of your sentiment sounds more like you are in a cult.

As long as your field keep coming with less and less realistic predictions and fail to deliver over and over, eventually even the most gullible will lose faith in you.

Because that's what this all is right now. Faith.

> Maybe a few more people will start to understand the trajectory we're on.

All you are saying is that you believe something will happen in the future.

We can't have a intelligent discussion under those premises.

It's depressing to see so many otherwise smart people fall for their own hype train. You are only helping rich people get more rich by spreading their lies.


I know I'm at fault for emotively complaining about "uninformed takes" in my comment instead of being substantive, which I regret, and I deserve replies such as this. I'll try harder to avoid getting into these arguments next time.

I wouldn't be an AI researcher if I didn't have "faith" that AI as a goal is worthwhile and achievable and I can make progress. You think this is irrational?

I am actually working to improve the SoTA in mathematical reasoning. I have documents full of concrete ideas for how to do that. So does everyone else in AI, in their niche. We are in an era of low hanging fruit enabled by ML breakthroughs such as large-scale transformers. I'm not someone who thinks you can simply keep scaling up transformers to solve AI. But consider System 1 and System 2 thinking: System 1 sure looks solved right now.

> As long as your field keep coming with less and less realistic predictions and fail to deliver over and over

I don't think we're commenting on the same article here. For example, FrontierMath was expected to be near impossible for LLMs for years, now here we are 5 weeks later at 25%.


a trickle of people sure, but most people never accidentally stumble upon good evaluation skills let alone reason themselves to that level, so i dont see how most people will have the semblance of an idea of a realistic trajectory of ai progress. i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.

doesnt help that most people are just mimics when talking about stuff thats outside their expertise.

Hell, my cousin a quality-college educated individual, high social/ emotional iq, will go down the conspiracy theory rabbit hole so quickly based on some baseless crap printed on the internet. then he’ll talk about people being satan worshipers.


You're being pretty harsh, but:

> i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.

Quite true. If you spend a lot of time reading and thinking about the workings of the mind you lose sight of how alien it is to intuition. While in highschool I first read, in New Scientist, the theory that conscious thought lags behind the underlying subconscious processing in the brain. I was shocked that New Scientist would print something so unbelievable. Yet there seemed to be an element of truth to it so I kept thinking about it and slowly changed my assessment.


sorry, humans are stupid and what intelligence they have is largely impotent. if this wasnt the case life wouldnt be this dystopia. my crassness comes from not necessarily trying to pick on a particular group of humans, just disappointment in recognizing the efficacy of human intelligence and its ability to turn reality into a better reality (meh).

yeah i was just thinking how a lot of thoughts which i thought were my original thoughts really were made possible out of communal thoughts. like i can maybe have some original frontier thoughts that involve averages but thats only made possible because some other person invented the abstraction of averages then that was collectively disseminated to everyone in education, not to mention all the subconscious processes that are necessary for me to will certainly thoughts into existsnce. makes me reflect on how much cognition is really mine, vs (not mine) a inevitable product of a deterministic process and a product of other humans.


> only made possible because some other person invented the abstraction of averages then that was collectively disseminated to everyone in education

What I find most fascinating about the history of mathematics is that basic concepts such as zero and negative numbers and graphs of functions, which are so easy to teach to students, required so many mathematicians over so many centuries. E.g. Newton figured out calculus because he gave so much thought to the works of Descartes.

Yes, I think "new" ideas (meaning, a particular synthesis of existing ones) are essentially inevitable, and how many people come up with them, and how soon, is a function of how common those prerequisites are.


Sounds like your cousin is able to think for himself. The amount of bullshit I hear from quality-college educated individuals, who simply repeat outdated knowledge that is in their college curriculum, is no less disappointing.


Buying whatever bullshit you see on the internet to such a degree that you're re-enacting satanic panic from the 80s is not "thinking for yourself". It's being gullible about areas outside your expertise.


Reflection isn’t a new concept, but a) actually proving that it’s an effective tool for these types of models and b) finding an effective method for reflection that doesn’t just locks you into circular “thinking” were the hard parts and hence the “breakthrough”.

It’s very easy to say hey ofc it’s obvious but there is nothing obvious about it because you are anthropomorphizing these models and then using that bias after the fact as a proof of your conjecture.

This isn’t how real progress is achieved.


Calling it reflection is, for me, further anthropomorphizing. However I am in violent agreement that a common feature of llm debate is centered around anthropomorphism leading to claims of "thinking longer" or "reflecting" when none of those things are happening.

The state of the art seems very focused on promoting that language that might encode reason is as good as actual reason, rather than asking what a reasoning model might look like.


I didn’t name it, to me I think it’s more about reflecting the output back on itself which doesn’t necessarily means anthropomorphism.


> ~doubling every 2-2.5 years) puts us at 20~25 years.

The trend for power consumption of compute (Megaflops per watt) has generally tracked with Koomey’s law for a doubling every 1.57 years

Then you also have model performance improving with compression. For example, Llama 3.1’s 8B outperforming the original Llama 65B


Then you will just have the issue of supplying enough of power to support this "linear" growth of yours.


who in this field is anticipating impact of near AGI for society ? maybe i'm too anxious but not planning for potential workless life seems dangerous (but maybe i'm just not following the right groups)


AGI would have a major impact on human work. Currently the hype is much greater than the reality. But it looks like we are starting to see some of the components of an AGI and that is cause for discussion of impact, but not panicked discussion. Even the chatbot customer service has to be trained on the domain. Still it is most useful in a few specific ways:

Routing to the correct human support

Providing FAQ level responses to the most common problems.

Providing a second opinion to the human taking the call.

So, even this most relevant domain for the technology doesn't eliminate human employment (because it's just not flexible or reliable enough yet).


Don’t forget humans which is real GI paired with increasing capable AI can create a feed back loop to accelerate new advances.


> are we stuck waiting for the 20-25 years for GPU improvements

If this turns out to be hard to optimize / improve then there will be a huge economic incentive for efficient ASICs. No freaking way we’ll be running on GPUs for 20-25 years, or even 2.


LLMs need efficient matrix multiiliers. GPUs are specialized ASICs for massive matrix multiplication.


LLMs get to maybe ~20% of the rated max FLOPS for a GPU. It’s not hard to imagine that a purpose built ASIC with maybe adjusted software stack gets us significantly more real performance.


They get more than this. For prefill we can get 70% matmul utilization, for generation less than this but we’ll get to >50 too eventually.


And even when you get to 100% utilization you’ll still be wasting a crazy amount of gates / die area, plus you’re paying the Nvidia tax. There is no way in hell that will go on for 10 years if we have good AGI but inference is too expensive.


Maybe another plane with a bunch of semiconductor people will disappear over Kazakhstan or something. Capitalist communisms gets bossier in stealth mode.

But sorry, blablabla, this shit is getting embarrassing.

> The question is now, can we close this "to human" gap

You won’t close this gap by throwing more compute at it. Anything in the sphere of creative thinking eludes most people in the history of the planet. People with PhDs in STEM end up working in IT sales not because they are good or capable of learning but because more than half of them can’t do squat shit, despite all that compute and all those algorithms in their brains.


> Super exciting that OpenAI pushed the compute out this far

it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?


> the fact that you even can use more compute to get more intelligence is a breakthrough.

I'm not so sure—what they're doing by just throwing more tokens at it is similar to "solving" the traveling salesman problem by just throwing tons of compute into a breadth first search. Sure, you can get better and better answers the more compute you throw at it (with diminishing returns), but is that really that surprising to anyone who's been following tree of thought models?

All it really seems to tell us is that the type of model that OpenAI has available is capable of solving many of the types of problems that ARC-AGI-PUB has set up given enough compute time. It says nothing about "intelligence" as the concept exists in most people's heads—it just means that a certain very artificial (and intentionally easy for humans) class of problem that wasn't computable is now computable if you're willing to pay an enormous sum to do it. A breakthrough of sorts, sure, but not a surprising one given what we've seen already.


An algorithm designed for translating between human languages has now been shown to generalize to solving visual IQ test puzzles, without much modification.

Yes, I find that surprising.


Maybe it's not linear spend.


I don't think this is only about efficiency. The model I have here is that this is similar to when we beat chess. Yes, it is impressive that we made progress on a class of problems, but is this class aligned with what the economy or the society needs?

Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?


I mostly agree with your analysis, but just to drive home a point here - I don't think that algorithms to beat Chess were ever seriously considered as something that would be relevant outside of the context of Chess itself. And obviously, within the world of Chess, they are major breakthroughs.

In this case there is more reason to think these things are relevant outside of the direct context - these tests were specifically designed to see if AI can do general-thinking tasks. The benchmarks might be bad, but that's at least their purpose (unlike in Chess).


ARC is designed to be hard for current models. It cannot be a proxy for how useful they are. It says something else. Most likely those models won't replace human at their tasks in their organization. Instead "we" will design pipeline so that the tasks aligns with the ability of the model and we will put the human at the periphery. Think of how a factory is organised for the robots.


okay, but what about literal swe-bench. O3 scored 75% eval


I wonder if we'll start seeing a shift in compute spend, moving away from training time, and toward inference time instead. As we get closer to AGI, we probably reach some limit in terms of how smart the thing can get just training on existing docs or data or whatever. At some point it knows everything it'll ever know, no matter how much training compute you throw at it.

To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.

Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.

Interesting times.


> I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.


Right. Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.

The other benchmarks are a good indication though.


> Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.

Well no, that would mean that Arc isn't actually testing the ability of a model to generalize then and we would need a better test. Considering it's by François Chollet, yep we need a better test.


Does it mean anything for more general tasks like driving a car?


Is every smart person a good driver?


That kind of proves that point that no matter how smart it can get, it may still have several disabilities that are crucial and very naive for humans. Is it generalizing on any task or specific set of tasks.


Likely yes. Every smart person is capable of being a good driver, so long as you give them enough training and incentive. Zero smart people are born being able to drive.


What about the archetype of the absent minded genius? I’ve met more several people who are shockingly intelligent but completely lose situational awareness on a regular basis.

And conversely, the world’s best drivers aren’t noted for being intellectual giants.

I don’t think driving skill and raw intelligence are that closely connected.


There are different kinds of smarts and not every smart person is good at all of them. Specifically, spacial reasoning is important for driving, and if a smart person is good at all kinds of thinking except that one, they're going to find it challenging to be a good driver.


Says the technical founder and CTO of our startup who exited with 9 figures and who also has a severe lazy eye: you don't want me driving. He got pulled over for suspected dui; totally clean, just can't drive straight


> ~=$3400 per single task

report says it is $17 per task, and $6k for whole dataset of 400 tasks.


"Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration."

The low compute was $17 per task. Speculate 172*$17 for the high compute is $2,924 per task, so I am also confused on the $3400 number.


3400 came from counting pixels on the plot.

Also its $20 on for the o3-low via the table for the semi-private, which x172 is 3440, also coming in close to the 3400 number


That's the low-compute mode. In the plot at the top where they score 88%, O3 High (tuned) is ~3.4k


The low compute one did as well as the average person though


sorry to be a noob, but can someone tell me doe sths mena o3 will be unaffordable for a typical user? Will only companies with thousands to spend per query be able to use this?

Sorry for being thick Im just confused how they can turn this into an addordable service?


There are likely many efficiency gains that will be made before it's released, and after. Also they showed o3 mini to be better than o1 for less cost in multiple benchmarks, so there're already improvements there at a lower cost than what available.


Great thank you


You're misreading it, there's two different runs, a low and a high compute run.

The number for the high-compute one is ~172x the first one according to the article so ~=$2900


What's extra confusing is that in the graph the runs are called low compute and high compute. In the table they're called high efficient and low efficiency. So the high and low got swapped.


That’s for the low-compute configuration that doesn’t reach human-level performance (not far though)


I referred on high compute mode. They have table with breakdown here: https://arcprize.org/blog/oai-o3-pub-breakthrough


The table row with 6k figure refers to high efficiency, not high compute mode. From the blog post:

Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration.


That's "efficiency" high, which actually means less compute. The 87.5% score using low efficiency (more compute) doesn't have cost listed.


they use some poor language.

"High Efficiency" is O3 Low "Low Efficiency" is O3 High

They left the "Low efficiency" (O3 High) values as `-` but you can infer them from the plot at the top.

Note the $20 and $17 per task aligns with the X-axis of the O3-low


That's high EFFICIENCY. High efficiency = low compute.


Efficiency has always been the key.

Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.

Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.

Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...


I am not so sure, but indeed it is perhaps also a sad realization.

You compare this to "a human" but also admit there is a high variation.

And, I would say there are a lot humans being paid ~=$3400 per month. Not for a single task, true, but for honestly for no value creating task at all. Just for their time.

So what about we think in terms of output rather than time?


Let's see when this will be released to the free tier. Looks promising, although I hope they will also be able to publish more details on this, as part of the "open" in their name


This is beta version. By the time they're done with this it'll be measured in single digit dollars, if not cents.


I think the real key is figuring out how to turn the hand-wavy promises of this making everything better into policy long fucking before we kick the door open. It’s self-evident that this being efficient and useful would be a technological revolution; what’s not self evident is that it wouldn’t benefit the large corporate entities that control even more disproportionately than it does now to the detriment of many other people.


The programming task they gave o3-mini high (creating Python server that allows chatting with OpenAI API and run some code in terminal) didn't seem very hard? Strange choice of example for something that's claimed to be a big step forwards.

YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)

Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...


Looks like quite shoddy code though. Like, the procedure for running a shell command is pure side-effect procedural code, neither returning the exit code of the command nor its output. Like the incomplete stackoverflow answer it probably was trained from. It might do one job at a time, but once this stuff gets integrated into one coherent thing, one needs to rewrite lots of it, to actually be composable.

Though, of course one can argue, that lots of human written code is not much different from this.


Which code is shoddy? The Claude or o3-mini one? If you mean Claude, then have you checked the o3-mini one is better?


Youtube is currently blocking my VPN, can't watch it.


It's good that it works since if you ask GPT-4o to use the openai sdk it will often produce invalid and out of date code.


But they did use a prompt that included a full example of how to call their latest model and API!


I would say they didn’t need to demo anything, because if you are gonna use the output code live on a demo it may make compile errors and then look stupid trying to fix it live


If it was a safe bet problem, then they should have said that. To me it looks like they faked excitement for something not exciting which lowers credibility of the whole presentation.


They actually did that the last time when they showed the apps integration. First try in Xcode didn't work.


Yeah I think that time it was ok because they were demoing the app function, but for this they are demoing the model smarts


Models are predictable at 0 temperatures. They might have tested the output beforehand.


Models in practice haven't been deterministic at 0 temperature, although nobody knows exactly why. Either hardware or software bugs.


We know exactly why, it is because floating point operations aren't associative but the GPU scheduler assumes they are, and the scheduler isn't deterministic. Running the model strictly hurts performance so they don't do that.


Cool, thanks a lot for the explanation. Makes sense.


Sonnet isn't a "mini" sized model. Try it with Haiku.


How mini is o3-mini compared to Sonnet and why does it matter whether it's mini or not? Isn't the point of the demo to show what's now possible that wasn't before?

4o is cheaper than o1 mini so mini doesn't mean much for costs.


What? Is this what this is? Either this is a complete joke or we're missing something.

I've been doing similar stuff in Claude for months and it's not that impressive when you see how limited they really are when going non boilerplate.



Yeah I agree that wasn't particularly mind blowing to me and seems fairly in line with what existing SOTA models can do. Especially since they did it in steps. Maybe I'm missing something.


Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far.

A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.

It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.

We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!


There is a benchmark, NovelQA, that LLMs don't dominate when it feels like they should. The benchmark is to read a novel and answer questions about it.

LLMs are below human evaluation, as I last looked, but it doesn't get much attention.

Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.

https://novelqa.github.io/


NovelQA is a great one! I also like GSM-Symbolic -- a benchmark based on making _symbolic templates_ of quite easy questions, and sampling them repeatedly, varying things like which proper nouns are used, what order relevant details appear, how many irrelevant details (GSM-NoOp) and where they are in the question, things like that.

LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)

https://machinelearning.apple.com/research/gsm-symbolic

https://arxiv.org/pdf/2410.05229

Paper came out in October, I don't think many have fully absorbed the implications.

It's hard to take any of the claims of "LLMs can do reasoning!" seriously, once you understand that simply changing what names are used in a 8th grade math word problem can have dramatic impact on the accuracy.


Looks like it's not updated for nearly a year and I'm guessing Gemini 2.0 Flash with 2m context will simply crush it


That's true. They don't have Claude 3.5 on there either. So maybe it's not relevant anymore, but I'm not sure.

If so, let's move on to the murder mysteries or more complex literary analysis.


> I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

I would think this is a not so good bench. Author does not write logically, they write for entertainment.


So I'm thinking of something like Locked-room mystery where the idea is it's solvable, and the reader is given a chance to solve.

The reason it seems like an interesting bench, is it's a puzzle presented in a long context. Its like testing if an LLm is at Sherlock Holmes level of world and motivation modelling.


That's an old leaderboard -- has no one checked any SOTA LLM in the last 8 months?


Does it work on short stories, but not novels? If so, then that's just a minor question of context length that should self-resolve over time.


The books fit in the current long context models, so it's not merely the context size constraint but the length is part of the issue, for sure.


Benchmark how? Is it good if the LLM can or can't solve it?


"The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning."

Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.


It shows objectively that the models are getting better at some form of reasoning, which is at least worth noting. Whether that improved reasoning is relevant for the real world is a different question.


It shows objectively that one model got better at this specific kind of weird puzzle that doesn't translate to anything because it is just a pointless pattern matching puzzle that can be trained for, just like anything else. In fact they specifically trained for it, they say so upfront.

It's like the modern equivalent of saying "oh when AI solves chess it'll be as smart as a person, so it's a good benchmark" and we all know how that nonsense went.


Hmm, you could be right, but you could also be very wrong. Jury's still out, so the next few years will be interesting.

Regarding the value of "pointless pattern matching" in particular, I would refer you to Douglas Hofstadter's discussion of Bongard problems starting on page 652 of _Godel, Escher, Bach_. Money quote: "I believe that the skill of solving Bongard [pattern recognition] problems lies very close to the core of 'pure' intelligence, if there is such a thing."


Well I certainly at least agree with that second part, the doubt if there is such a thing ;)

The problem with pattern matching of sequences and transformers as an architecture is that it's something they're explicitly designed to be good at with self attention. Translation is mainly matching patterns to equivalents in different languages, and continuing a piece of text is following a pattern that exists inside it. This is primarily why it's so hard to draw a line between what an LLM actually understands and what it just wings naturally through pattern memorization and why everything about them is so controversial.

Honestly I was really surprised that all models did so poorly on ARC in general thus far, since it really should be something they ought to be superhuman at from the get-go. Probably more of a problem that it's visual in concept than anything else.


It doesn't follow, faulty logic. The two are probably correlated though.


This emphasizes persons and a self-conceived victory narrative over the ground truth.

Models have regularly made progress on it, this is not new with the o-series.

Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.

I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.

Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.


Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front. (Not sure I believe the speculation about o3's internals in the link.)

What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.


Your argumentation seems convincing but I'd like to offer a competitive narrative: any benchmark that is public becomes completely useless because companies optimize for it - especially AI that depends on piles of money and they need some proof they are developing.

That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).

On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.


Rather any Logic puzzle you post on the internet as something AIs are bad at is in the next round of training data so AIs get better at that specific question. Not because AI companies are optimizing for a benchmark but because they suck up everything.


ARC has two test sets that are not posted on the Internet. One is kept completely private and never shared. It is used when testing open source models and the models are run locally with no internet access. The other test set is used when testing closed source models that are only available as APIs. So it could be leaked in theory, but it is still not posted on the internet and can't be in any web crawls.

You could argue that the models can get an advantage by looking at the training set which is on the internet. But all of the tasks are unique and generalizing from the training set to the test set is the whole point of the benchmark. So it's not a serious objection.


Given the delivery mechanism for OpenAI, how do they actually keep it private?


> So it could be leaked in theory

That's why they have two test sets. But OpenAI has legally committed to not training on data passed to the API. I don't believe OpenAI would burn their reputation and risk legal action just to cheat on ARC. And what they've reported is not implausible IMO.


Yeah I'm sure the Microsoft-backed company headed by Mr. Worldcoin Altman whose sole mission statement so far has been to overhype every single product they released wouldn't dare cheat on one of these benchmarks that "prove" AGI (as they've been claiming since GPT-2).


> o3 presumably isn't doing program synthesis

I'd guess it's doing natural language procedural synthesis, the same way a human might (i.e. figuring the sequence of steps to effect the transformation), but it may well be doing (sub-)solution verification by using the procedural description to generate code whose output can then be compared to the provided examples.

While OpenAI haven't said exactly what the architecture of o1/o3 are, the gist of it is pretty clear - basically adding "tree" search and iteration on top of the underlying LLM, driven by some RL-based post-training that imparts generic problem solving biases to the model. Maybe there is a separate model orchestrating the search and solution evaluation.

I think there are many tasks that are easy enough for humans but hard/impossible for these models - the ultimate one in terms of commercial value would be to take an "off the shelf model" and treat it as an intern/apprentice and teach it to become competent in a entire job it was never trained on. Have it participate in team meetings and communications, and become a drop-in replacement for a human performing that job (any job that an be performed remotely without a physical presence).


> Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front.

Agreed.

> And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs.

? There's plenty.


I'd love to hear about more. Which ones are you thinking of?


- "Are You Human" https://arxiv.org/pdf/2410.09569 is designed to be directly on target, i.e. cross cutting set of questions that are easy for humans, but challenging for LLMs, Instead of one type of visual puzzle. Much better than ARC for the purpose you're looking for.

- SimpleBench https://simple-bench.com/ (similar to above; great landing page w/scores that show human / ai gap)

- PIQA (physical question answering, i.e. "how do i get a yolk out of a water bottle", common favorite of local llm enthusiasts in /r/localllama https://paperswithcode.com/dataset/piqa

- Berkeley Function-Calling (I prefer https://gorilla.cs.berkeley.edu/leaderboard.html)

AI search googled "llm benchmarks challenging for ai easy for humans", and "language model benchmarks that humans excel at but ai struggles with", and "tasks that are easy for humans but difficult for natural language ai".

It also mentioned Moravec's Paradox is a known framing of this concept, started going down that rabbit hole because the resources were fascinating, but, had to hold back and submit this reply first. :)


Thanks for the pointers! I hadn't seen Are You Human. Looks like it's only two months old. Of course it is much easier to design a test specifically to thwart LLMs now that we have them. It seems to me that it is designed to exploit details of LLM structure like tokenizers (e.g. character counting tasks) rather than to provide any sort of general reasoning benchmark. As such it seems relatively straightforward to improve performance in ways that wouldn't necessarily represent progress in general reasoning. And today's LLMs are not nearly as far from human performance on the benchmark as they were on ARC for many years after it was released.

SimpleBench looks more interesting. Also less than two months old. It doesn't look as challenging for LLMs as ARC, since o1-preview and Sonnet 3.5 already got half of the human baseline score; they did much worse on ARC. But I like the direction!

PIQA is cool but not hard enough for LLMs.

I'm not sure Berkeley Function-Calling represents tasks that are "easy" for average humans. Maybe programmers could perform well on it. But I like ARC in part because the tasks do seem like they should be quite straightforward even for non-expert humans.

Moravec's paradox isn't a benchmark per se. I tend to believe that there is no real paradox and all we need is larger datasets to see the same scaling laws that we have for LLMs. I see good evidence in this direction: https://www.physicalintelligence.company/blog/pi0


> "I'm not sure Berkeley Function-Calling represents tasks that are easy for average humans. Maybe programmers could perform well on it."

Functions in this context are not programming function calls. In this context, function calls are a now-deprecated LLM API name for "parse input into this JSON template." No programmer experience needed. Entity extraction by another name, except, that'd be harder: here, you're told up front exactly the set of entities to identify. :)

> "Moravec's paradox isn't a benchmark per se."

Yup! It's a paradox :)

> "Of course it is much easier to design a test specifically to thwart LLMs now that we have them"

Yes.

Though, I'm concerned a simple yes might be insufficient for illumination here.

It is a tautology (it's easier to design a test that $X fails when you have access to $X), and it's unlikely you meant to just share a tautology.

A potential unstated-but-maybe-intended-communication is "it was hard to come up with ARC before LLMs existed" --- LLMs existed in 2019 :)

If they didn't, a hacky way to come up with a test that's hard for the top AIs at the time, BERT-era, would be to use one type of visual puzzle.

If, for conversations sake, we ignore that it is exactly one type of visual puzzle, and that it wasn't designed to be easy for humans, then we can engage with: "its the only one thats easy for humans, but hard for LLMs" --- this was demonstrated as untrue as well.

I don't think I have much to contribute past that, once we're at "It is a singular example of a benchmark thats easy for humans but nigh-impossible for llms, at least in 2019, and this required singular insight", there's just too much that's not even wrong, in the Pauli sense, and it's in a different universe from the original claims:

- "Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far."

- "A lot of people have criticized ARC as not being relevant or indicative of true reasoning...The fact that [o-series models show progress on ARC proves that what it measures really is relevant and important for reasoning."

- "...nobody could quantify exactly the ways the models were deficient..."

- "What we need right now are "easy" benchmarks that these models nevertheless fail."


How long has SimpleBench been posted? Out of the first 6 questions at https://simple-bench.com/try-yourself, o1-pro got 5/6 right.

It was interesting to see how it failed on question 6: https://chatgpt.com/c/6765e70e-44b0-800b-97bd-928919f04fbe

Apparently LLMs do not consider global thermonuclear war to be all that big a deal, for better or worse.


Don't worry, I also got that wrong :) I thought her affair would be the biggest problem for John.


John was an ex, not her partner. Tricky.


Gaming the benchmarks usually needs to be considered first when evaluating new results.


I think gaming the benchmarks is encouraged in the ARC AGI context. If you look at the public test cases you'll see they test a ton of pretty abstract concepts - space, colour, basic laws of physics like gravity/magnetism, movement, identity and lots of other stuff (highly recommend exploring them). Getting an AI to do well at all, regardless of whether it was gamed or not, is the whole challenge!


Honestly, is gaming benchmarks actually a problem in this space in that it still shows something useful? Just means we need more benchmarks, yeah? It really feels not unlike keggle competitions.

We do the same exact stuff with real people with programming challenges and such where people just study common interview questions rather than learning the material holistically. And since we know that people game these interview type questions, we can adjust the interview processes to minimize gamification.... which itself leads to gamification and back to step one. That's not ideal an ideal feedback loop of course, but people still get jobs and churn out "productive work" out of it.


AI are very good at gaming benchmarks. Both as overfitting and as Goodhart's law, gaming benchmarks has been a core problem during training for as long as I've been interested in the field.

Sometimes this manifests as "outside the box thinking", like how a genetic algorithm got an "oscillator" which was really just an antenna.

It is a hard problem, and yes we still both need and can make more and better benchmarks; but it's still a problem because it means the benchmarks we do have are overstating competence.


The idea behind this particular benchmark, at least, is that it can't be gamed. What are some ways to game ARC-AGI, meaning to pass it without developing the required internal model and insights?

In principle you can't optimize specifically for ARC-AGI, train against it, or overfit to it, because only a few of the puzzles are publicly disclosed.

Whether it lives up to that goal, I don't know, but their approach sounded good when I first heard about it.


Well, with billions in funding you could task a hundred or so very well paid researchers to do their best at reverse engineering the general thought process which went into ARC-AGI, and then generate fresh training data and labeled CoTs until the numbers go up.


Right, but the ARC-AGI people would counter by saying they're welcome to do just that. In doing so -- again in their view -- the researchers would create a model that could be considered capable of AGI.

I spent a couple of hours looking at the publicly-available puzzles, and was really impressed at how much room for creativity the format provides. Supposedly the puzzles are "easy for humans," but some of them were not... at least not for me.

(It did occur to me that a better test of AGI might be the ability to generate new, innovative ARC-AGI puzzles.)


It's tricky to judge the difficulty of these sorts of things. Eg, breadth of possibilities isn't an automatic sign of difficulty. I imagine the space of programming problems permits as much variety as ARC-AGI, but since we're more familiar with problems presented as natural language descriptions of programming tasks, and since we know there's tons of relevant text on the web, we see the abstract pictographic ARC-AGI tasks as more novel, challenging, etc. But, to an LLM, any task we can conceive of will be (roughly) as familiar as the amount of relevant training data it's seen. It's legitimately hard to internalize this.

For a space of tasks which are well-suited to programmatic generation, as ARC-AGI is by design, if we can do a decent job of reverse engineering the underlying problem generating grammar, then we can make an LLM as familiar with the task as we're willing to spend on compute.

To be clear, I'm not saying solving these sorts of tasks is unimpressive. I'm saying that I find it unsuprising (in light of past results) and not that strong of a signal about further progress towards the singularity, or FOOM, or whatever. For any of these closed-ish domain tasks, I feel a bit like they're solving Go for the umpteenth time. We now know that if you collect enough relevant training data and train a big enough model with enough GPUs, the training loss will go down and you'll probably get solid performance on the test set. Trillions of reasonably diverse training tokens buys you a lot of generalization. Ie, supervised learning works. This is the horse Ilya Sutskever's ridden to many glorious victories and the big driver of OpenAI's success -- a firm belief that other folks were leaving A LOT of performance on the table due to a lack of belief in the power of their own inventions.


We're in agreement!

What's endlessly interesting to me with all of this is how surprisingly quick the benchmarking feedback loops have become plus the level of scrutiny each one receives. We (as a culture/society/whatever) don't really treat human benchmarking criteria with the same scrutiny such that feedback loops are useful and lead to productive changes to the benchmarking system itself. So from that POV it feels like substantial progress continues to be made through these benchmarks.


I won't be as brutal in my wording, but I agree with the sentiment. This was something drilled into me as someone with a hobby in PC Gaming and Photography: benchmarks, while handy measures of potential capabilities, are not guarantees of real world performance. Very few PC gamers completely reinstall the OS before benchmarking to remove all potential cruft or performance impacts, just as very few photographers exclusively take photos of test materials.

While I appreciate the benchmark and its goals (not to mention the puzzles - I quite enjoy figuring them out), successfully passing this benchmark does not demonstrate or guarantee real world capabilities or performance. This is why I increasingly side-eye this field and its obsession with constantly passing benchmarks and then moving the goal posts to a newer, harder benchmark that claims to be a better simulation of human capabilities than the last one: it reeks of squandered capital and a lack of a viable/profitable product, at least to my sniff test. Rather than simply capitalize on their actual accomplishments (which LLMs are - natural language interaction is huge!), they're trying to prove to Capital that with a few (hundred) billion more in investments, they can make AGI out of this and replace all those expensive humans.

They've built the most advanced prediction engines ever conceived, and insist they're best used to replace labor. I'm not sure how they reached that conclusion, but considering even their own models refute this use case for LLMs, I doubt their execution ability on that lofty promise.


100%. The hype is misguided. I doubt half the people excited about the result have even looked at what the benchmark is.


Highly challenging for LLMs because it has nothing to do with language. LLMs and their training processes have all kinds of optimizations for language and how it's presented.

This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.

Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.


The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.

If the hypothesis is that LLMs are the “computer” that drives the AGI then of course the benchmark is relevant in testing for AGI.

I don’t think you understand the benchmark and its motivation. ARC AGI benchmark problems are extremely easy and simple for humans. But LLMs fail spectacularly at them. Why they fail is irrelevant, the fact they fail though means that we don’t have AGI.


> The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.

It's a bunch of visual puzzles. They aren't a test for AGI because it's not general. If models (or any other system for that matter) could solve it, we'd be saying "this is a stupid puzzle, it has no practical significance". It's a test of some sort of specific intelligence. On top of that, the vast majority of blind people would fail - are they not generally intelligent?

The name is marketing hype.

The benchmark could be called "random puzzles LLMs are not good at because they haven't been optimized for it because it's not valuable benchmark". Sure, it wasn't designed for LLMs, but throwing LLMs at it and saying "see?" is dumb. We can throw in benchmarks for tennis playing, chess playing, video game playing, car driving and a bajillion other things while we are at it.


And all that is kind of irrelevant, because if LLMs were human-level general intelligence, they would solve all these questions correctly without blinking.

But they don't. Not even the best ones.


No human would score high on that puzzle if the images were given to them as a series of tokens. Even previous LLMs scored much better than humans if tested in the same way.


And most humans would do well on maths problems if the input was given to them as binary. The reason that reversal isn't important is that the Tokens are an implementation detail for how an AI is meant to solve real world problems that humans face while noone cares about humans solving tokens.


Humans communicate with each other to get things done. We have to think carefully how we communicate with each other given the shortcomings of humans and shortcomings of different communication mediums.

The fact that we might need to be mindful of how we communicate with a person/system/whatever doesn't mean too much in the context of AI. Just like humans, the details of how they work will need to be considered, and the standard trope of "that's an implementation detail" won't work.


> making the most interesting and challenging LLM benchmark so far.

This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.

1. https://epoch.ai/frontiermath/the-benchmark


Apparently o3 scored about 25%

https://youtu.be/SKBG1sqdyIU?t=4m40s


This is actually the result that I find way more impressive. Elite mathematicians think these problems are challenging and thought they were years away from being solvable by AI.


You're right, I was wrong to say "most challenging" as there have been harder ones coming out recently. I think the correct statement would be "most challenging long-standing benchmark" as I don't believe any other test designed in 2019 has resisted progress for so long. FrontierMath is only a month old. And of course the real key feature of ARC is that it is easy for humans. FrontierMath is (intentionally) not.


They should put some famous, unsolved problems in the next edition so ML researchers do some actually useful work while they're "gaming" the benchmarks :)


I'm certain that the big labs will be gunning for the Millenium Prize problems.


I liked the SimpleQA benchmark that measures hallucinations. OpenAI models did surprisingly poorly, even o1. In fact, it looks like OpenAI often does well on benchmarks by taking the shortcut to be more risk prone than both Anthropic and Google.


Because LLMs are on an off-ramp path towards AGI. A generally intelligent system can brute force its way with just memory.

Once a model recognizes a weakness through reasoning with CoT when posed to a certain problem and gets the agency to adapt to solve that problem that's a precursor towards real AGI capability!


It's the least interesting benchmark for language models among all they've released, especially now that we already had a large jump in its best scores this year. It might be more useful as a multimodal reasoning task since it clearly involves visual elements, but with o3 already performing so well, this has proven unnecessary. ARC-AGI served a very specific purpose well: showcasing tasks where humans easily outperformed language models, so these simple puzzles had their uses. But tasks like proving math theorems or programming are far more impactful.


ARC wasn't designed as a benchmark for LLMs, and it doesn't make much sense to compare them on it since it's the wrong modality. Even a MLM with image inputs can't be expected to do well, since they're nothing like 99.999% of the training data. The fact that even a text-only LLM can solve ARC problems with the proper framework is important, however.


> The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.

One might also interpret that as "the fact that models which are studying to the test are getting better at the test" (Goodhart's law), not that they're actually reasoning.


Are there any single-step non-reasoner models that do well on this benchmark?

I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.


    | Name                                 | Semi-private eval | Public eval |
    |--------------------------------------|-------------------|-------------|
    | Jeremy Berman                        | 53.6%             | 58.5%       |
    | Akyürek et al.                       | 47.5%             | 62.8%       |
    | Ryan Greenblatt                      | 43%               | 42%         |
    | OpenAI o1-preview (pass@1)           | 18%               | 21%         |
    | Anthropic Claude 3.5 Sonnet (pass@1) | 14%               | 21%         |
    | OpenAI GPT-4o (pass@1)               | 5%                | 9%          |
    | Google Gemini 1.5 (pass@1)           | 4.5%              | 8%          |

https://arxiv.org/pdf/2412.04604


why is this missing the o1 release / o1 pro models? Would love to know how much better they are


This might be because they are referencing single step, and I do not think o1 is single step.


Akyürek et al uses test-time compute.


Here are the results for base models[1]:

  o3 (coming soon)  75.7% 82.8%
  o1-preview        18%   21%
  Claude 3.5 Sonnet 14%   21%
  GPT-4o            5%    9%
  Gemini 1.5        4.5%  8%
Score (semi-private eval) / Score (public eval)

[1]: https://arcprize.org/2024-results


It's easy to miss, but if you look closely at the first sentence of the announcement they mention that they used a version of o3 trained on a public dataset of ARC-AGI, so technically it doesn't belong on this list.


It's all scam. ClosedAI trained on the data they were tested on, so no, nothing here is impressive.


Just a clarification, they tuned on the public training dataset, not the semi-private one. The 87.5% score was on the semi-private eval, which means the model was still able to generalize well.

That being said, the fact that this is not a "raw" base model, but one tuned on the ARC-AGI tests distribution takes away from the impressiveness of the result — How much ? — I'm not sure, we'd need the un-tuned base o3 model score for that.

In the meantime, comparing this tuned o3 model to other un-tuned base models is unfair (apples-to-oranges kind of comparison).


They definitely did or they probably did? Is there any source for that just so I can point It out to people?


I'd love to know how Claude 3.5 Sonnet does so well despite (presumably) not having the same tricks as the o-series models.


i am confused cause this dataset is visual-based, and yet being used to measure 'LLM'. I feel like the visual nature of it was really the biggest hurdle to solving it.


Human performance is 85% [1]. o3 high gets 87.5%.

This means we have an algorithm to get to human level performance on this task.

If you think this task is an eval of general reasoning ability, we have an algorithm for that now.

There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.

Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!

[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1


As excited as I am by this, I still feel like this is still just a small approximation of a small chunk of human reasoning ability at large. o3 (and whatever comes next) feels to me like it will head down the path of being a reasoning coprocessor for various tasks.

But, still, this is incredibly impressive.


Which parts of reasoning do you think is missing? I do feel like it covers a lot of 'reasoning' ground despite its on the surface simplicity


My personal 5 cents is that reasoning will be there when LLM gives you some kind of outcome and then when questioned about it can explain every bit of result it produced.

For example, if we asked an LLM to produce an image of a "human woman photorealistic" it produces result. After that you should be able to ask it "tell me about its background" and it should be able to explain "Since user didn't specify background in the query I randomly decided to draw her standing in front of a fantasy background of Amsterdam iconic houses. Usually Amsterdam houses are 3 stories tall, attached to each other and 10 meters wide. Amsterdam houses usually have cranes on the top floor, which help to bring goods to the top floor since doors are too narrow for any object wider than 1m. The woman stands in front of the houses approximately 25 meters in front of them. She is 1,59m tall, which gives us correct perspective. It is 11:16am of August 22nd which I used to calculate correct position of the sun and align all shadows according to projected lighting conditions. The color of her skin is set at RGB:xxxxxx randomly" etc.

And it is not too much to ask LLMs for it. LLMs have access to all the information above as they read all the internet. So there is definitely a description of Amsterdam architecture, what a human body looks like or how to correctly estimate time of day based on shadows (and vise versa). The only thing missing is logic that connects all this information and which is applied correctly to generate final image.

I like to think about LLMs as a fancy genius compressing engines. They took all the information in the internet, compressed it and are able to cleverly query this information for end user. It is a tremendously valuable thing, but if intelligence emerges out of it - not sure. Digital information doesn't necessarily contain everything needed to understand how it was generated and why.


I see two approaches for explaining the outcome: 1. Reasoning back on the result and justifying it. 2. Explainability - somehow justifying by looking at which neurons have been called. The first could lead to lying. E.g. think of a high schooler explaining copied homework. While the second one does indeed access the paths influencing the decision, but is a hard task due to the inherent way neural networks work.


> if we asked an LLM to produce an image of a "human woman photorealistic" it produces result

Large language models don't do that. You'd want an image model.

Or did you mean "multi-model AI system" rather than "LLM"?


It might be possible for a language model to paint a photorealistic picture though.


It is not.

You are confusing LLM:s with Generative AI.


No, I'm not confusing it. I realize that LLMs sometimes connect with diffusion models to produce images. I'm talking about language models actually describing pixel data of the image.


Can an LLM use tools like humans do? Could it use an image model as a tool to query the image?


No, a LLM is a Large Language Model.

It can language.


You could teach it to emit patterns that (through other code) invoke tools, and loop the results back to the LLM.


I think it's hard to enumerate the unknown, but I'd personally love to see how models like this perform on things like word problems where you introduce red herrings. Right now, LLMs at large tend to struggle mightily to understand when some of the given information is not only irrelevant, but may explicitly serve to distract from the real problem.


That’s not inability to reason though, that’s having a social context.

Humans also don’t tend to operate in a rigorously logical mode and understand that math word problems are an exception where the language may be adversarial: they’re trained for that special context in school. If you tell the LLM that social context, eg that language may be deceptive, their “mistakes” disappear.

What you’re actually measuring is the LLM defaults to assuming you misspoke trying to include relevant information rather than that you were trying to trick it — which is the social context you’d expect when trained on general chat interactions.

Establishing context in psychology is hard.


o1 already fixed the red herrings...


LLMs are still bound to a prompting session. They can't form long term memories, can't ponder on it and can't develop experience. They have no cognitive architecture.

'Agents' (i.e. workflows intermingling code and calls to LLMs) are still a thing (as shown by the fact there is a post by anthropic on this subject on the front page right now) and they are very hard to build.

Consequence of that for instance: it's not possible to have a LLM explore exhaustively a topic.


LLMs don’t, but who said AGI should come from LLMs alone. When I ask ChatGPT about something “we” worked on months ago, it “remembers” and can continue on the conversation with that history in mind.

I’d say, humans are also bound to promoting sessions in that way.


Last time I used ChatGPT 'memory' feature it got full very quickly. It remembered my name, my dog's name and a couple tobacco casing recipes he came up with. OpenAI doesn't seem to be using embeddings and a vector database, just text snippets it injects in every conversation. Because RAG is too brittle ? The same problem arises when composing LLM calls. Efficient and robust workflows are those whose prompts and/or DAG were obtained via optimization techniques. Hence DSPy.

Consider the following use case: keeping a swimming pool water clean. I can have a long running conversation with a LLM to guide me in getting it right. However I can't have a LLM handle the problem autonomously. I'd like to have it notify me on its own "hey, it's been 2 days, any improvement? Do you mind sharing a few pictures of the pool as well as the ph/chlorine test results ?". Nothing mind-boggingly complex. Nothing that couldn't be achieved using current LLMs. But still something I'd have to implement myself and which turns out to be more complex to achieve than expected. This is the kind of improvement I'd like to see big AI companies going after rather than research-grade ultra smart AIs.


Optimal phenomenological reasoning is going to be a tough nut to crack.

Luckily we don't know the problem exists, so in a cultural/phenomenological sense it is already cracked.


Current AI is good at text but not very good at 3d physical stuff like fixing your plumbing.


Does it include the use of tools to accomplish a task?

Does it include the invention of tools?


kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.

people with (high) intelligence talking and building (artificial) intelligence but never able to convincingly explain aspects of intelligence. just often talk ambiguously and circularly around it.

what are we humans getting ourselves into inventing skynet :wink.

its been an ongoing pet project to tackle reasoning, but i cant answer your question with regards to llms.


>> Kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.

Kinda interesting that mathematicians also can't do the same for mathematics.

And yet.


Mathematicians absolutely can, it's called foundations, and people actively study what mathematics can be expressed in different foundations. Most mathematicians don't care about it though for the same reason most programmers don't care about Haskell.


I don't care about Haskell either, but we know what reasoning is [1]. It's been studied extensively in mathematics, computer science, psychology, cognitive science and AI, and in philosophy going back literally thousands of years with grandpapa Aristotle and his syllogisms. Formal reasoning, informal reasoning, non-monotonic reasoning, etc etc. Not only do we know what reasoning is, we know how to do it with computers just fine, too [2]. That's basically the first 50 years of AI, that folks like His Nobelist Eminence Geoffrey Hinton will tell you was all a Bad Idea and a total failure.

Still somehow the question keeps coming up- "what is reasoning". I'll be honest and say that I imagine it's mainly folks who skipped CS 101 because they were busy tweaking their neural nets who go around the web like Diogenes with his lantern, howling "Reasoning! I'm looking for a definition of Reasoning! What is Reasoning!".

I have never heard the people at the top echelons of AI and Deep learning - LeCun, Schmidhuber, Bengio, Hinton, Ng, Hutter, etc etc- say things like that: "what's reasoning". The reason I suppose is that they know exactly what that is, because it was the one thing they could never do with their neural nets, that classical AI could do between sips of coffee at breakfast [3]. Those guys know exactly what their systems are missing and, to their credit, have never made no bones about that.

_________________

[1] e.g. see my profile for a quick summary.

[2] See all of Russeel & Norvig, as a for instance.

[3] Schmidhuber's doctoral thesis was an implementation of genetic algorithms in Prolog, even.


i have a question for you, in which ive asked many philosophy professors but none could answer satisfactorily. since you seem to have a penchant for reasoning perhaps you might have a good answer. (i hope i remember the full extent of the question properly, i might hit you up with some follow questions)

it pertains to the source of the inference power of deductive inference. do you think all deductive reasoning originated inductively? like when some one discovers a rule or fact that seemingly has contextual predictive power, obviously that can be confirmed inductively by observations, but did that deductive reflex of the mind coagulate by inductive experiences. maybe not all deductive derivative rules but the original deductive rules.


I'm sorry but I have no idea how to answer your question, which is indeed philosophical. You see, I'm not a philosopher, but a scientist. Science seeks to pose questions, and answer them; philosophy seeks to pose questions, and question them. Me, I like answers more than questions so I don't care about philosophy much.


well yeah its partially philosphical, i guess my haphazard use of language like “all” makes it more philosophical than intended.

but im getting at a few things. one of those things is neurological. how do deductive inference constructs manifest in neurons and is it really inadvertently an inductive process that that creates deductive neural functions.

other aspect of the question i guess is more philosophical. like why does deductive inference work at all, i think clues to a potential answer to that can be seen in the mechanics of generalization of antecedents predicting(or correlating with) certain generalized consequences consistently. the brain coagulates generalized coinciding concepts by reinforcement and it recognizes or differentiates inclusive instances or excluding instances of a generalization by recognition properties that seem to gatekeep identities accordingly. its hard to explain succinctly what i mean by the latter, but im planning on writing an academic paper on that.


I'm sorry, I don't have the background to opine on any of the subjects you discuss. Good luck with your paper!


>Those guys know exactly what their systems are missing

If they did not actually, would they (and you) necessarily be able to know?

Many people claim the ability to prove a negative, but no one will post their method.


To clarify, what neural nets are missing is a capability present in classical, logic-based and symbolic systems. That's the ability that we commonly call "reasoning". No need to prove any negatives. We just point to what classical systems are doing and ask whether a deep net can do that.


Do Humans have this ability called "reasoning"?


well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.

i doubt your mathmatician example is equivalent.

examples that are fresh on the mind that further my point. ive heard yann lecun baffled by llms instantiation/emergence of reasoning, along with other ai researchers. eric Schmidt thinks the agentic reasoning is the current frontier and people should be focusing on that. was listening to the start of an ai machine learning interview a week ago with some cs phd asked to explain reasoning and the best he could muster up is you know it when you see it…. not to mention the guy responding to the grandparent that gave a cop out answer ( all the most respect to him).


>> well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.

I'm going to bet you haven't encountered the right people then. Maybe your social circle is limited to folks like the person who presented a slide about A* to a dumb-struck roomfull of Deep Learning researchers, in the last NeurIps?

https://x.com/rao2z/status/1867000627274059949


possibly, my university doesn’t really do ai research beyond using it as a tool to engineer things. im looking to transfer to a different university.

but no, my take on reasoning is really a somewhat generalized reframing of the definition of reasoning (which you might find on the stanford encylopedia of philosophy) thats reframed partially in axiomatic building blocks of neural network components/terminology. im not claiming to have discovered reasoning, just redefine it in a way thats compatible and sensible to neural networks (ish).


Well you're free to define and redefine anything and as you like, but be aware that every time you move the target closer to your shot you are setting yourself up for some pretty strong confirmation bias.


yeah thats why i need help from the machine interpretability crowd to make sure my hypothesized reframing of reasoning has sufficient empirical basis and isn’t adrift in lalaland.


Care to enlighten us with your explanation of what "reasoning" is?


terribly sorry to be such a tease, but im looking to publish a paper on it, and still need to delve deeper into machine interpretability to make sure its empirically properly couched. if u can help with that perhaps we can continue this convo in private.


I'd like to see this o3 thing play 5d chess with multiverse time travel or baba is you.

The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.

If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.


> climate change is one of the most difficult and worst problems of our time.

Slightly surprised to see this view here.

I can think of half a dozen more serious problems off hand (e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself) along most axes I can think of (raw $ cost, QALYs, even X-risk).


None of those problems really matter if we don't have a planet to live on


You've been greviously mislead if you think climate change could plausibly make the world uninhabbitable in the next couple of centuries given current trajectories. I advise going to the primary sources and emailing a climate scientist at your local university for some references.


> going to the primary sources and emailing a climate scientist at your local university for some references

I assume you've done this, otherwise you wouldn't be telling me to? Bold of you to assume my ignorance on this subject. You sound like you've fallen for corporate grifters who care more about short-term profit and gains over long-term sustainability (or you are one of said grifters, in which case why are you wasting your time on HN, shouldn't you be out there grinding?!)

Severe weather events are going to get more common and more devastating over the next couple of decades. They'll come for you and people you care about, just as they come for me and people I care about. It doesn't matter what you think you know about it.


I've read some climate papers but haven't done the email thing (I should, but have not).

The IPCC summaries are a good read too.

Do you genuinely think severe weather events are going to be even amongst the top ten killers this century? If so, I do strongly advise emailing local uni climate scientist. (What's the worst that can happen? Heck, they might confirm your views!)

(In other circumstances I might go through the whole "what have you observed that has given you this belief?" thing, but in this case there is a simple and reliable check in the form of a 5 minute email)

... actually, I can do so on your behalf... would you like me to? The specific questions I would be asking unless told otherwise would be:

1. Probability of human extinction in the next century due to climate change. 2. Probability of more than 10% of human deaths being due to extreme weather. 3. Places to find good unbiased summaries of the likely effects of climate change.

Any others?


Please do! I would love for you to do this.

Would you be so kind to ask

1. Do you think a tornado has real probability of forming in north-western Europe, where historically there has never been one before? And what do you think are the chances of it being destructive in ways before unseen? (Think Netherlands, Belgium, Germany, ...)

2. How are the attractors (chaos theory) changing? Is it correct to say that, no, our weather prediction models are not going to be more accurate, all we can say is that weather is going to _change_ in all extremes? More intense storms. Colder winters. Hotter summers. Drier droughts.

3. What institution predicted the floods in Spain? Did anyone? Or was this completely unprecedented and a complete surprise?

I think these are my primary questions for now.


I don't think that humans will go extinct from climate change, but it will drastically change where we can comfortably live and will uproot our ability to make meaningful cultural and scientific progress.

In your comment above you mention: > e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself

These are all intertwined with each other and with climate change. People are less likely to have kids if they don't think those kids will have a comfortable future. Nuclear war is more likely if countries are competing for less and less resources as we deplete the planet and need to increase food production. Habitat loss from deforestation leads to animals comingling where they normally wouldn't, leading to increased risk of disease spillover into humans.

You claim that somebody saying "climate change is one of the most difficult and worst problems of our time" is a take you're surprised to see here on HN, but I'm more surprised that you don't list it in what you consider important problems.


Still it's comparing average human level performance with best AI performance. Examples of things o3 failed at are insanely easy for humans.


You'd be surprised what the AVERAGE human fails to do that you think is easy, my mom can't fucking send an email without downloading a virus, i have a coworker that believes beyond a shadow of a doubt the world is flat.

The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person


Not being able to send an email or believing the world is flat it’s not a sign of intelligence, I’d rather say it’s more about culture or being more or less scholarized. Your mom or coworker still can do stuff instinctively that is outperforming every algorithm out there and still unexplained how we do it. We still have no idea what intelligence is


Yet the average human can drive a car a lot better than ChatGPT can, which shows that the way you frame "intelligence" dictates your conclusion about who is "intelligent".


Pretty sure a waymo car drives better than an average SF driver.


And how well would a Waymo car do in this challenge with the ARC-AGI datasets?


Waymo cannot handle poor weather at all, average human can.

Being able to perform better than humans in specific constrained problem space is how every automation system has been developed.

While self driving systems are impressive, they don’t drive with anywhere close to skills of the average driver


Waymo blog with video of them driving in poor weather https://waymo.com/blog/2019/08/waymo-and-weather


And nikola famously made a video of a truck using one which had no engine, we don’t take a company word for anything until we can verify.

This is not offered to public, they are actively expanding in only cities like LA , Miami or Phoenix now where weather is good through the year.

The tech for bad weather is nowhere close to ready for public. Average human on other hand is driving in bad weather every day


"Extreme Weather" tech "will be available to riders in the near future" https://www.cnet.com/roadshow/news/waymos-latest-robotaxi-is...


I'm sure the source of that CNET article came with a forward looking statements disclaimer.


There's a reason why Waymo isn't offered in Buffalo.


Is that reason because Buffalo is the 81st most populated city in the United States, or 123rd by population density, and Waymo currently only serves approximately 3 cities in North America?

We already let computers control cars because they're better than humans at it when the weather is inclement. It's called ABS.


I would guess you haven't spent much time driving in the winter in the Northeast.

There is an inherent danger to driving in snow and ice. It is a PR nightmare waiting to happen because there is no way around accidents if the cars are on the road all the time in rust belt snow.


I get the feeling that the years I spent in Boston with a car including during the winter and driving to Ithaca somehow aren't enough, but whether or not I have is irrelevant. Still, I'll repeat the advice I was given before you have to drive in snow, go practice driving in the snow (in eg a parking lot) before needing to do so, esp during a storm. Waymo's been spotted driving in Buffalo doing testing, so it seems someone gave them similar advice. https://www.wgrz.com/article/tech/waymo-self-driving-car-pro...

There's always an inherent risk to driving, even in sunny Phoenix, Az. Winter dangers like black ice further multiply that risk but humans still manage to drive in winter. Taking a picture/video of a snowed over road and judging the width and inventing lanes based on the width taking into account snowbanks doesn't take an ML algorithm. Lidar can see black ice while human eyes can not, giving cars equipped with lidar (wether driven by a human or a computer) an advantage over those without it, and Waymo cars currently have lidar.

I'm sure there are new challenges for Waymo to solve before deploying the service in Buffalo, but it's not this unforeseen gotcha parent comment implies.

As far as the possible PR nightmare, you'd never do self driving cars in the first place if you let that fear control you because, you you pointed out, driving on the roads is inherently dangerous with too many unforeseen complications.


If you take an electrical sensory input signal sequence, and transform it to a electrical muscle output signal sequence you've got a brain. ChatGPT isn't going to drive a car because it's trained on verbal tokens, and it's not optimized for the type of latency you need for physical interaction.

And the brain doesn't use the same network to do verbal reasoning as real time coordination either.

But that work is moving along fine. All of these models and lessons are going to be combined into AGI. It is happening. There isn't really that much in the way.


Maybe, but no doubt these "dumb" people can still get dressed in the morning, navigate a trip to the mall, do the dishes, etc, etc.

It's always been the case that the things that are easiest for humans are hardest for computers, and vice versa. Humans are good at general intelligence - tackling semi-novel problems all day long, while computers are good at narrow problems they can be trained on such as chess or math.

The majority of the benchmarks currently used to evaluate these AI models are narrow skills that the models have been trained to handle well. What'll be much more useful will be when they are capable of the generality of "dumb" tasks that a human can do.


Your examples are just examples of lack of information. That's not a measure for intelligence.

As a contrary point, most people think they are smarter than they really are.


There are things Chimps do easily that humans fail at, and vice/versa of course.

There are blind spots, doesn't take away from 'general'.


We can't agree whether Portia spiders are intelligent or just have very advanced instincts. How will we ever agree about what human intelligence is, or how to separate it from cultural knowledge? If that even makes sense.


I guess my point is more, if we can't decide about Portia Spiders or Chimps, then how can we be so certain about AI. So offering up Portia and Chimps as counter examples.


The downvotes should tell you, this is a decided "hype" result. Don't poo poo it, that's not allowed on AI slop posts on HN.


Yeah, I didn't realize Chimp studies, or neuroscience were out of vogue. Even in tech, people form strong 'beliefs' around what they think is happening.


What’s interesting is it might be very close to human intelligence than some “alien” intelligence, because after all it is a LLM and trained on human made text, which kind of represents human intelligence.


In that vein, perhaps the delta between o3 @ 87.5% and Human @ 85% represents a deficit in the ability of text to communicate human reasoning.

In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.


It's possible humans reason better through text than not through text, so these models, having been trained on text, should be able to out-reason any person who's not currently sitting down to write.


I wonder how much of an effect amount of time to answer has on human performance.


Yeah, this is sort of meaningless without some idea of cost or consequences of a wrong answer. One of the nice things about working with a competent human is being able to tell them "all of our jobs are on the line" and knowing with certainty that they'll come to a good answer.


Agreed. I think what really makes them alien is everything else about them besides intelligence. Namely, no emotional/physiological grounding in empathy, shame, pride, and love (on the positive side) or hatred (negative side).


Human performance is much closer to 100% on this, depending on your human. It's easy to miss the dot in the corner of the headline graph in TFA that says "STEM grad."


A fair comparison might be average human. The average human isn't a STEM grad. It seems STEM grad approximately equals an IQ of 130. https://www.accommodationforstudents.com/student-blog/the-su...

From a post elsewhere the scores on ARC-AGI-PUB are approx average human 64%, o3 87%. https://news.ycombinator.com/item?id=42474659

Though also elsewhere, o3 seems very expensive to operate. You could probably hire a PhD researcher for cheaper.


Why would an average human be more fair than a trained human? The model is trained.


It's not saturated. 85% is average human performance, not "best human" performance. There is still room for the model to go up to 100% on this eval.


Curious about how many tests were performed. Did it consistently manage to successfully solve many of these types of problems?


NNs are not algorithms.


An algorithm is “a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer”

How does a giant pile of linear algebra not meet that definition?


It's not made of "steps", it's an almost continuous function to its inputs. And a function is not an algorithm: it is not an object made of conditions, jumps, terminations, ... Obviously it has computation capabilities and is Turing-complete, but is the opposite of an algorithm.


If it wasn’t made of steps then Turing machines wouldn’t be able to execute them.

Further, this is probably running an algorithm on top of an NN. Some kind of tree search.

I get what you’re saying though. You’re trying to draw a distinction between statistical methods and symbolic methods. Someday we will have an algorithm which uses statistical methods that can match human performance on most cognitive tasks, and it won’t look or act like a brain. In some sense that’s disappointing. We can build supersonic jets without fully understanding how birds fly.


Let's see that Turing machines can approximate the execution of NN :) That's why there are issues related to numerical precision, but the contrary is also true indeed, NNs can discover and use similar techniques used by traditional algorithms. However: the two remain two different methods to do computations, and probably it's not just by chance that many things we can't do algorithmically, we can do with NNs, what I mean is that this is not just related to the fact that NNs discover complex algorithms via gradient descent, but also that the computational model of NNs is more adapt to solving certain tasks. So the inference algorithm of NNs (doing multiplications and other batch transformations) is just needed for standard computers to approximate the NN computational model. You can do this analogically, and nobody would claim much (maybe?) it's running an algorithm. Or that brains themselves are algorithms.


Computers can execute precise computations, it's just not efficient (and it's very much slow).

NNs are exactly what "computers" are good for and we've been using since their inception: doing lots of computations quickly.

"Analog neural networks" (brains) work much differently from what are "neural networks" in computing, and we have no understanding of their operation to claim they are or aren't algorithmic. But computing NNs are simply implementations of an algorithm.

Edit: upon further rereading, it seems you equate "neural networks" with brain-like operation. But brain was an inspiration for NNs, they are not an "approximation" of it.


But the inference itself is orthogonal to the computation the NN is going. Obviously the inference (and training) are algorithms.


NN inference is an algorithm for computing an approximation of a function with a huge number of parameters. The NN itself is of course just a data structure. But there is nothing whatsoever about the NN process that is non-algorithmic.

It's the exact same thing as using a binary tree to discover the lowest number in some set of numbers, conceptually: you have a data structure that you evaluate using a particular algorithm. The combination of the algorithm and the construction of the data structure arrive at the desired outcome.


That's not the point, I think: you can implement the brain in BASIC, in theory, this does not means that the brain is per-se a BASIC program. I'll provide a more theoretical framework for reasoning about this: if the way to solve certain problems by an NN (the learned weights) can't be translated in some normal program that DOES NOT resemble the activation of an NN, then the NNs are not algorithms, but a different computational model.


This may be what they were getting it, but it is still wrong. An NN is a computable function. So, NN inference is an algorithm for computing the function the NN represents. If we have an NN that represents a function f, with f(text) = most likely next character a human would write, then running the inference for that NN is an algorithm for finding out which character it's most likely a human would write next.

It's true that this is not an "enlightening" algorithm, it doesn't help us understand why or how that is the most likely next character. But this doesn't mean it's not an algorithm.


We don’t have evidence that a TM can simulate a brain. But we know for a fact that it can execute a NN.


> It's not made of "steps", it's an almost continuous function to its inputs.

Can you define "almost continuous function"? Or explain what you mean by this, and how it is used in the A.I. stuff?


Well, it's a bunch of steps, but they're smaller. /s


Each layer of the network is like a step, and each token prediction is a repeat of those layers with the previous output fed back into it. So you have steps and a memory.


I would say you are right that function is not an algorithm, but it is an implementation of an algorithm.

Is that your point?

If so, I've long learned to accept imprecise language as long as the message can be reasonably extracted from it.


> continuous

So, steps?


"Continuous" would imply infinitely small steps, and as such, would certainly be used as a differentiator (differential? ;) between larger discrete stepped approach.

In essence, infinite calculus provides a link between "steps" and continuous, but those are different things indeed.


Deterministic (ieee 754 floats), terminates on all inputs, correctness (produces loss < X on N training/test inputs)

At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.


How do you define "algorithm"? I suspect it is a definition I would find somewhat unusual. Not to say that I strictly disagree, but only because to my mind "neural net" suggests something a bit more concrete than "algorithm", so I might instead say that an artificial neural net is an implementation of an algorithm, rather than or something like that.

But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.


NN is a very wide term applied in different contexts.

When a NN is trained, it produces a set of parameters that basically define an algorithm to do inference with: it's a very big one though.

We also call that a NN (the joy of natural language).


Running inference on a model certainly is a algorithm.


I’ll believe it when the AI can earn money on its own. I obviously don’t mean someone paying a subscription to use the AI I mean, letting the AI lose on the Internet with only the goal of making money and putting it into a bank account.


You don't think there are already plenty of attempts out there?

When someone is "disinterested enough" to publish though, note the obvious way to launch a new fund or advisor with a good track record: crank out a pile of them, run them one or two years, discard the many losers and publish the one or two top winners. I.E. first you should be suspicious of why it's being published, then of how selected that result is.


Do trading bots count?


No, the AI would have to start from zero and reason it's way to making itself money online, such as the humans who were first in their online field of interest (e-commerce, scams, ads etc from the 80's and 90's) when there was no guidance, only general human intelligence that could reason their way into money making opportunities and reason their way into making it work.


I don't think humans ever do that. They research/read and ask other humans.


Which AI already has stored in spades, even more so since people in the 80's 90's weren't working with the information available today. The AI is free to research and read all the information stored from other humans as well, just like the humans who reasoned their way into money making opportunities--only with vastly more information now, talk about an advantage. But is it intelligent enough do so without a human giving direct/step-by-step instructions; the way humans figure it out?


It actually beats the human average by a wide margin:

- 64.2% for humans vs. 82.8%+ for o3.

...

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374


Super human isn't beating rando mech turk.

Their post has stem grad at nearly 100%


This is correct. It's easy to get arbitrarily bad results on Mechanical Turk, since without any quality control people will just click as fast as they can to get paid (or bot it and get paid even faster).

So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.

In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.


This is so strange. people think that an llm trained on programming questions and docs can do mundane tasks like this means intelligent? Come on.

It really calls into question two things.

1. You don't know what you're talking about about.

2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.

Either way, not a good look.


This


Let me go against some skeptics and explain why I think full o3 is pretty much AGI or at least embodies most essential aspects of AGI.

What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)

ARC has been challenging precisely because solving its problems often requires:

   1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND

   2) using the right level(s) of abstraction
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.

It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.

[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...

ADDED:

Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.

I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)


Quote from the creators of the AGI-ARC benchmark: "Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence."


I like the notion, implied in the article, that AGI won't be verified by any single benchmark, but by our collective inability to come up with benchmarks that defeat some eventual AI system. This matches the cat-and-mouse game we've been seeing for a while, where benchmarks have to constantly adapt to better models.

I guess you can say the same thing for the Turing Test. Simple chat bots beat it ages ago in specific settings, but the bar is much higher now that the average person is familiar with their limitations.

If/once we have an AGI, it will probably take weeks to months to really convince ourselves that it is one.


I'd need to see what kinds of easy tasks those are and would be happy to revise my hypothesis if that's warranted.

Also, it depends a great deal on what we define as AGI and whether they need to be a strict superset of typical human intelligence. o3's intelligence is probably superhuman in some aspects but inferior in others. We can find many humans who exhibit such tendencies as well. We'd probably say they think differently but would still call them generally intelligent.


They're in the original post. Also here: https://x.com/fchollet/status/1870172872641261979 / https://x.com/fchollet/status/1870173137234727219

Personally, I think it's fair to call them "very easy". If a person I otherwise thought was intelligent was unable to solve these, I'd be quite surprised.


Thanks! I've analyzed some easy problems that o3 failed at. They involve spatial intelligence including connection and movement. This skill is very hard to learn from textual and still image data.

I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.

(OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)


> I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.

Maybe! I suppose time will tell. That said, spatial intelligence (connection/movement included) is the whole game in this evaluation set. I think it's revealing that they can't handle these particular examples, and problematic for claims of AGI.


Probably just not trained on this kind of data. We could create a benchmark about it, and they'd shatter it within a year or so.

I'm starting to really see no limits on intelligence in these models.


Doesn't the fact that it can only accomplish tasks with benchmarks imply that it has limitations in intelligence?


> Doesn't the fact that it can only accomplish tasks with benchmarks

That's not a fact


> This skill is very hard to learn from textual and still image data.

I had the same take at first, but thinking about it again, I'm not quite sure?

Take the "blue dots make a cross" example (the second one). The inputs only has four blue dots, which makes it very easy to see a pattern even in text data: two of them have the same x coordinate, two of them have the same y (or the same first-tuple-element and second-tuple-element if you want to taboo any spatial concepts).

Then if you look into the output, you can notice that all the input coordinates are also in the output set, just not always with the same color. If you separate them into "input-and-output" and "output-only", you quickly notice that all of the output-only squares are blue and share a coordinate (tuple-element) with the blue inputs. If you split the "input-and-output" set into "same color" and "color changed", you can notice that the changes only go from red to blue, and that the coordinates that changed are clustered, and at least one element of the cluster shares a coordinate with a blue input.

Of course, it's easy to build this chain of reasoning in retrospect, but it doesn't seem like a complete stretch: each step only requires noticing patterns in the data, and it's how a reasonably puzzle-savvy person might solve this if you didn't let them draw the squares on papers. There are a lot of escape games with chains of reasoning much more complex and random office workers solve them all the time.

The visual aspect makes the patterns jump to us more, but the fact that o3 couldn't find them at all with thousands of dollars of compute budget still seems meaningful to me.

EDIT: Actually, looking at Twitter discussions[1], o3 did find those patterns, but was stumped by ambiguity in the test input that the examples didn't cover. Its failures on the "cascading rectangles" example[2] looks much more interesting.

[1]: https://x.com/bio_bootloader/status/1870339297594786064

[2]: https://x.com/_AI30_/status/1870407853871419806


Yeah the real goalpost is reliable intelligence. A supposed phd level AI failing simple problems is a red flag that we’re still missing something.


You've never met a Doctor who couldn't figure out how to work their email? Or use street smarts? You can have a PHD but be unable to reliably handle soft skills, or any number of things you might 'expect' someone to be able to do.

Just playing devils' advocate or nitpicking the language a bit...


An important distinction here is you’re comparing skill across very different tasks.

I’m not even going that far, I’m talking about performance on similar tasks. Something many people have noticed about modern AI is it can go from genius to baby-level performance seemingly at random.

Take self driving cars for example, a reasonably intelligent human of sound mind and body would never accidentally mistake a concrete pillar for a road. Yet that happens with self-driving cars, and seemingly here with ARC-AGI problems which all have a similar flavor.


A coworker of mine has a phd in physics. Showing the difference to him between little and big endian in a hex editor, showing file sizes of raw image files and how to compute it... I explained 3 times and maybe he understood part of it now.


Doctors[1] or say pilots are skilled professions and difficult to master and deserve respect yes , but they do not need high levels of intelligence to be good at. They require many other skills like taking decisions under pressure or good motor skills that are hard, but not necessarily intelligence.

Also not knowing something is hardly a criteria , skilled humans focus on their areas of interest above most other knowledge and can be unaware of other subjects.

Fields medal winners for example may not be aware of most pop culture things doesn’t make them not able to do so, just not interested

—-

[1] most doctors including surgeons and many respected specialists, some doctors however do need that skills but those are specialized few and generally do know how to use email


good nit pick.

A PHD learnt their field. If they learnt that field, reasoning through everything to understand their material, then - given enough time - they are capable of learning email and street smarts.

Which is why a reasoning LLM, should be able to do all of those things.

Its not learnt a subject, its learnt reasoning.


they say it isn't AGI but i think the way o3 functions can be refined to AGI - it's learning to solve a new, novel problems. we just need to make it do that more consistently, which seems achievable


Have we really watered down the definition of AGI that much?

LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.

Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.


> LLMs aren't really capable of "learning" anything outside their training data.

ChatGPT has had for some time the feature of storing memories about its conversations with users. And you can use function calling to make this more generic.

I think drawing the boundary at “model + scaffolding” is more interesting.


Calling the sentence or two it arbitrarily saves when you statd your preferences and profile info "memories" is a stretch.

True equivalent to human memories would require something like a multimodal trillion token context window.

RAG is just not going to cut it, and if anything will exacerbated problems with hallucinations.


Well, now you’ve moved the goalposts from “learn anything” to “learn at human level”. Sure, they don’t have that yet.


Thats the whole point of llama index? I can connect my LLM to any node or context i want. Syncing it to a real time data flow like an API and it can learn...? How is that different than a human?

Once optimus is up an working by the 100k+, the spatial problems will be solved. We just don't have enough spatial awareness data, or for a way for the LLM to learn about the physical world.


That's true for vanilla LLMs, but also keep in mind that there are no details about o3's architecture at the moment. Clearly they are doing something different given the huge performance jump on a lot of benchmarks, and it may well involve in-context learning.


Given every other iteration has basically just been the same thing but bigger, why should we think this?


My point was to caution against being too confident about the underlying architecture, not to argue for any particular alternative.

Your statement is false - things changed a lot between gpt4 and o1 under the hood, but notably not a larger model size. In fact the model size of o1 is smaller than gpt4 by several orders of magnitude! Improvements are being made in other ways.


What's your explanation for why it can only get ~70% on SWE-bench Verified?

I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)


I've spent tons of time evaluating o1-preview on SWEBench-Verified.

For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.

For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.


So what percentage would you say falls to simple inability versus the other two factors you've mentioned?


One possibility is that it may not yet have sufficient experience and real-world feedback for resolving coding issues in professional repos, as this involves multiple steps and very diverse actions (or branching factor, in AI terms). They have committed to not training on API usage, which limits their ability to directly acquire training data from it. However, their upcoming agentic efforts may address this gap in training data.


Right, but the branching factor increases exponentially with the scope of the work.

I think it's obvious that they've cracked the formula for solving well-defined, small-in-scope problems at a superhuman level. That's an amazing thing.

To me, it's less obvious that this implies that they will in short order with just more training data be able to solve ambiguous, large-in-scope problems at even just a skilled human level.

There are far more paths to consider, much more context to use, and in an RL setting, the rewards are much more ambiguously defined.


Their reasoning models can learn from procedures and methods, which generalize far better than data. Software tasks are diverse but most tasks are still fairly limited in scope. Novel tasks might remain challenging for these models, as they do for humans.

That said, o3 might still lack some kind of interaction intelligence that’s hard to learn. We’ll see.


GPQA scores are mostly from pre-training, against content in the corpus. They have gone silent but look at the GPT4 technical report which calls this out.

We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.

As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.

As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.

I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.

Heck we aren't close to P with commercial models.


Isn't any physically realizable computer (including our brains) limited to what uniform-TC0 can do?


Neither TC0 nor uniform-TC0 are physically realizable, they are tools not physical devices.

The default nonuniform circuits classes are allowed to have a different circuit per input size, the uniform types have unbounded fan-in

Similar to how a k-tape TM doesn't get 'charged' for the input size.

With Nick Class (NC) the number of components is similar to traditional compute time while depth relates to the ability to parallelize operations.

These are different than biological neurons, not better or worse but just different.

Human neurons can use dendritic compartmentalization, use spike timing, can retime spikes etc...

While the perceptron model we use in ML is useful, it is not able to do xor in one layer, while biological neurons do that without anything even reaching the soma, purely in the dendrites.

Statistical learning models still comes down to a choice function, no matter if you call that set shattering or...

With physical computers the time hierarchy does apply and if TIME(g(n)) is given more time than TIME(f(n)), g(n) can solve more problems.

So you can simulate a NTM with exhaustive search with a physical computer.

Physical computers also tend to have NAND and XOR gates, and can have different circuit depths.

When you are in TC0, you only have AND, OR and Threshold (or majority) gates.

Think of instruction level parallelism in a typical CPU, it can return early, vs Itanium EPIC, which had to wait for the longest operation. Predicated execution is also how GPUs work.

They can send a mask and save on load store ops as an example, but the cost of that parallelism is the consent depth.

It is the parallelism tradeoff that both makes transformers practical as well as limit what they can do.

The IID assumption and autograd requiring smooth manifolds plays a role too.

The frame problem, which causes hard problems to become unsolvable for computers and people alike does also.

But the fact that we have polynomial time solutions for the Boolean Formula Value Problem, as mentioned in my post above is probably a simpler way of realizing physical computers aren't limited to uniform-TC0.


Do you just mean because any physically realizable computer is a finite state machine? Or...?

I wouldn't describe a computer's usual behavior as having constant depth.

It is fairly typical to talk about problems in P as being feasible (though when the constant factors are too big, this isn't strictly true of course).

Just because for unreasonably large inputs, my computer can't run a particular program and produce the correct answer for that input, due to my computer running out of memory, we don't generally say that my computer is fundamentally incapable of executing that algorithm.


>Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforce, AIME, and Frontier Math strongly suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it.

The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?


Great point. I'd love to see what these easy tasks are and would be happy to revise my hypothesis accordingly. o3's intelligence is unlikely to be a strict superset of human intelligence. It is certainly superior to humans in some respects and probably inferior in others. Whether it's sufficiently generally intelligent would be both a matter of definition and empirical fact.


Chollet has a few examples here:

https://x.com/fchollet/status/1870172872641261979

https://x.com/fchollet/status/1870173137234727219

I would definitely consider them legitimately easy for humans.


Thanks! I added some comments on this at the bottom of the post above.


Please stop it calling AGI, we don’t even know or agree universally what that should actually mean. How far did we get with hype calling a lossy probabilistic compressor firing slowly at us words AGI? That’s a real bummer to me


Is this comment voted down because of sentiment / polarity?

Regardless the critical aspect is valid, AGI would be something like Cortana from Halo.


Personally I find "human-level" to be a borderline meaningless and limiting term. Are we now super human as a species relative to ourselves just five years ago because of our advances in developing computer programs that better imitate what many (but far from all) of us were already capable of doing? Have we reached a limit to human potential that can only be surpassed by digital machines? Who decides what human level is and when we have surpassed it? I have seen some ridiculous claims about ai in art that don't stand up to even the slightest scrutiny by domain experts but that easily fool the masses.


No I think we're just tired and depressed as a species... Existing systems work to a degree but aren't living up to their potential of increasing happiness according to technological capabilities.


The problem with ARC is that there are a finite number of heuristics that could be enumerated and trained for, which would give model a substantial leg up on this evaluation, but not be generalized to other domains.

For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.

Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.


on the spatial data i see it as a highly intelligent head of a machine that just needs better limbs and better senses.

i think that's where most hardware startups will specialize with in the coming decades, different industries with different needs.


Great comment. See this as well for another potential reason for failure:

https://arxiv.org/abs/2402.10013


In order to replace actual humans doing their job I think LLMs are lacking in judgement, sense of time and agenticism.


I mean fkcu me when they have those things, however, maybe they are just lazy and their judgement is fine, for a lazy intelligence. Inner-self thinks "why are these bastards asking me to do this? ". I doubt that is actually happening, but now, .. prove it isn't.


> It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could.

Every human does this dozens, hundreds or thousands of times ... during childhood.


Ask o3 is P=NP?


It will just answer with the current consensus on the matter.


how about if it can work at a job? people can do that, can o3 do it?


This is not AGI lmao.


Agree. AGI is here. I feel such a sense of pride in our species.


Incredibly impressive. Still can't really shake the feeling that this is o3 gaming the system more than it is actually being able to reason. If the reasoning capabilities are there, there should be no reason why it achieves 90% on one version and 30% on the next. If a human maintains the same performance across the two versions, an AI with reason should too.


The point of ARC is NOT to compare humans vs AI, but to probe the current boundary of AIs weaknesses. AI has been beating us at specific tasks like handwriting recognition for decades. Rather, it's when we can no longer readily find these "easy for human, hard for AI" reasoning tasks that we must stop and consider.

If you look at the ARC tasks failed by o3, they're really not well suited to humans. They lack the living context humans thrive on, and have relatively simple, analytical outcomes that are readily processed by simple structures. We're unlikely to see AI as "smart" until it can be asked to accomplish useful units of productive professional work at a "seasoned apprentice" level. Right now they're consuming ungodly amounts of power just to pass some irritating, sterile SAT questions. Train a human for a few hours a day over a couple weeks and they'll ace this no problem.


o3 low and high are the same model. Difference is in how long was it allowed to think.

It works the same with humans. If they spend more time on the puzzle they are more likely to solve it.


But does it matter if it "really, really" reasons in the human sense, if it's able to prove some famous math theorem or come up with a novel result in theoretical physics?

While beyond current motels, that would be the final test of AGI capability.


If it's gaming the system, then it's much less likely to reliably come up with novel proofs or useful new theoretical ideas.


That would be important, but as far as I know it hasn’t happened (despite how often it’s intimated that we’re on the verge of it happening).


I've seen one Twitter thread from a mathematician who used an llm to come up with a new math result. Both coming up with the theorem statement and a unique proof,iirc.

Though to be clear, this wasn't a one shot thing - it was iirc a few months of back and forth chats with plenty of wrong turns too.


Then he used it as a random text generator, LLM is by far the most configurable and best random test generators we have. You can use that to generate random theorem noise and then try to work with that to find actual theorems, still doesn't replace mathematicians though.


I think we should let the professional mathematician who says the llm helped him be the judge of how and why it helped.

Found the thread: https://x.com/robertghrist/status/1841462507543949581?s=46&t...

From the thread:

> AI assisted in the initial conjectures, some of the proofs, and most of the applications it was truly a collaborative effort

> i went back and forth between outrageous optimism and frustration through this process. i believe that the current models can reason – however you want to interpret that. i also believe that there is a long way to go before we get to true depth of mathematical results.


Yeah, it really does matter if something was reasoned, or whether it appears if you metaphorically shake the magic 8 ball.


How would gaming the system work here? Is there some flaw in the way the tasks are generated?


AI models have historically found lots of ways to game systems. My favorite example is exploiting bugs in simulator physics to "cheat" at games of computer tag. Another is a model for radiology tasks finding biases in diagnostic results using dates on the images. And of course whenever people discuss a benchmark publicly it leaks the benchmark into the training set, so the benchmark becomes a worse measure.


I am not expert in llm reasoning but I think because of RL. You cannot use AlphaZero to play other games.


Nope. AlphaZero taught itself to play games like chess, shogi, and Go through self-play, starting from random moves. It was not given any strategies or human gameplay data but was provided with the basic rules of each game to guide its learning process.


Yes its reinforcement learning, but need to create policy and each policy is specialized for specific tasks.


I thought that AlphaZero could play three games? Go, Chess and Shogi?


Think I mean Catan :)


Humans and AIs are different, the next benchmark would be build so that it emphasize the weak points of current AI models where a human is expected to perform better, but I guess you can also make a benchmark that is the opposite, where humans struggle and o3 has an easy time.


I think you've hit the nail on the head there. If these systems of reasoning are truly general then they should be able to perform consistently in the same way a human does across similar tasks, baring some variance.


Yes, if a system has actually achieved AGI, it is likely to not reveal that information


AGI wouldn't necessarily entail any autonomy or goals though. In principle there could be a superintelligent AI that's completely indifferent to such outcomes, with no particular goals beyond correctly answering question or what not.


AGI is a spectrum, not a binary quality.


Not sure why I am being downvoted. Why would a sufficiently advanced intelligence reveal its full capabilities knowing fully well that it would then be subjected to a range of constraints and restraints?

If you disagree with me, state why instead of opting to downvote me


The cost to run the highest performance o3 model is estimated to be somewhere between $2,000 and $3,400 per task.[1] Based on these estimates, o3 costs about 100x what it would cost to have a human perform the exact same task. Many people are therefore dismissing the near-term impact of these models because of these extremely expensive costs.

I think this is a mistake.

Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.

Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?

There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.

So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.

[1] https://news.ycombinator.com/item?id=42473876


Your economic analysis is deeply flawed. If there was anything that valuable and that required that much manpower, it would already have driven up the cost of labor accordingly. The one property that could conceivably justify a substantially higher cost is secrecy. After all, you can't (legally) kill a human after your project ends to ensure total secrecy. But that takes us into thriller novel territory.


I don't think that's right. Free societies don't tolerate total mobilization by their governments outside of war time, no matter how valuable the outcomes might be in the long term, in part because of the very economic impacts you describe. Human-level AI - even if it's very expensive - puts something that looks a lot like total mobilization within reach without the societal pushback. This is especially true when it comes to tasks that society as a whole may not sufficiently value, but that a state actor might value very much, and when paired with something like a co-located reactor and data center that does not impact the grid.

That said, this is all predicated on o3 or similar actually having achieved human level reasoning. That's yet to be fully proven. We'll see!


This is interesting to consider, but I think the flaw here is that you'd need a "total mobilization" level workforce in order to build this mega datacenter in the first place. You put one human-hour into making B200s and cooling systems and power plants, you get less than one human-hour-equivalent of thinking back out.


No you don’t. The US government has already completed projects at this scale without total economic mobilization: https://en.wikipedia.org/wiki/Utah_Data_Center Presumably peer and near-peer states are similarly capable.

A private company, xAI, was able to build a datacenter on a similar scale in less than 6 months, with integrated power supply via large batteries: https://www.tomshardware.com/desktops/servers/first-in-depth...

Datacenter construction is a one-time cost. The intelligence the datacenter (might) provide is ongoing. It’s not an equal one to one trade, and well within reach for many state and non-state actors if it is desired.

It’s potentially going to be a very interesting decade.


i disagree because the job market is not a true free market. I mean it mostly is, but there’s a LOT of politics and shady stuff that employers do to purposely drive wages down. Even in the tech sector.

Your secrecy comment is really intriguing actually. And morbid lol.


How many 99.9th percentile mathematicians do nation states normally have access to?


Direct quote from the ARC-AGI blog:

“SO IS IT AGI?

ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.

Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”

The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.

Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…


> Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned”

Something I missed until I scrolled back to the top and reread the page was this

> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set

So yeah, the results were specifically from a version of o3 trained on the public training set

Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.

On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.


Lol I missed that even though it's literally the first sentence of the blog, good catch.

Yeah, that makes this result a lot less impressive for me.


ARC co-founder Mike Knoop

"Raising visibility on this note we added to address ARC "tuned" confusion:

> OpenAI shared they trained the o3 we tested on 75% of the Public Training set.

This is the explicit purpose of the training set. It is designed to expose a system to the core knowledge priors needed to beat the much harder eval set.

The idea is each training task shows you an isolated single prior. And the eval set requires you to recombine and abstract from those priors on the fly. Broadly, the eval tasks require utilizing 3-5 priors.

The eval sets are extremely resistant to just "memorizing" the training set. This is why o3 is impressive." https://x.com/mikeknoop/status/1870583471892226343


Great catch. Super disappointing that AI companies continue to do things like this. It’s a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.


To me it's very frustrating because such little caveats make benchmarks less reliable. Implicitly, benchmarks are no different from tests in that someone/something who scores high on a benchmark/test should be able to generalize that knowledge out into the real world.

While that is true with humans taking tests, it's not really true with AIs evaluating on benchmarks.

SWE-bench is a great example. Claude Sonnet can get something like a 50% on verified, whereas I think I might be able to score a 20-25%? So, Claude is a better programmer than me.

Except that isn't really true. Claude can still make a lot of clumsy mistakes. I wouldn't even say these are junior engineer mistakes. I've used it for creative programming tasks and have found one example where it tried to use a library written for d3js for a p5js programming example. The confusion is kind of understandable, but it's also a really dumb mistake.

Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.

Or maybe benchmarks are just bad at measuring intelligence in general.

Regardless, every time a model beats a benchmark I'm annoyed by the fact that I have no clue whatsoever how much this actually translates into real world performance. Did OpenAI/Anthropic/Google actually create something that will automate wide swathes of the software engineering profession? Or did they create the world's most knowledgeable junior engineer?


> Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.

My understanding is that it works by checking if the proposed solution passes test-cases included in the original (human) PR. This seems to present some problems too, because there are surely ways to write code that passes the tests but would fail human review for one reason or another. It would be interesting to not only see the pass rate but also the rate at which the proposed solutions are preferred to the original ones (preferably evaluated by a human but even an LLM comparing the two solutions would be interesting).


If I recall correctly the authors of the benchmark did mention on Twitter that for certain issues models will submit an answer that technically passes the test but is kind of questionable, so yeah, good point.