Hacker News new | past | comments | ask | show | jobs | submit | comeonbro's comments login

This article is about things which aren't limitations anymore!

You are applauding it as pushback for pushback's sake, but it's an article about limitations in biplane construction, published after we'd already landed on the moon.


Is there any evidence that these fundamental issues with compositionality have been resolved or are you just asserting it? Has the paper been replicated with a CoT model and had a positive result?


Well, yes — because modern models can solve all the examples in the article. The theory of compositionality is still an issue, but the evidence for it recedes.

I think most of the issue comes from the challenge of informational coherence. Once incoherence enters the context, the intelligence drops massively. You can have a lot of context and LLMs can maintain coherence— but not if the context itself is incoherent.

And, informationally, it is just a matter of time before a little incoherence gets into a thread.

This is why agents have so much potential—being able to separate out separate threads of thought in different context windows reduces the likelihood of incoherence emerging (vs one long thread).

Actually, maybe “cybernetic ecologies” are closer to what I mean than “agents.” See Anthropic’s “Building Effective Agents.” https://www.anthropic.com/research/building-effective-agents


>I think most of the issue comes from the challenge of informational coherence. Once incoherence enters the context, the intelligence drops massively. You can have a lot of context and LLMs can maintain coherence— but not if the context itself is incoherent.

As a non-expert, part of my definition of intelligence is that the system can detect incoherence, a.k.a reject bullshit. LLMs today can't do that and will happily emit bullshit in response.

Maybe the "gates" in the "workflows" discussed in the Anthropic article are a practical solution to that. But that still just seems like inserting human intelligence into the system for a specific engineering domain; not a general solution.


> Chatbot Software Begins to Face Fundamental Limitations

> Recent results show that large language models struggle with compositional tasks, suggesting a hard limit to their abilities.

Your first question with anything like this should always be WHICH MODELS:

> For our experiments, we evaluate the performance of 6 LLMs: GPT4 (gpt-4) [58], ChatGPT (GPT3.5-turbo) [57], GPT3 (text-davinci-003) [11], FlanT5 [17] and LLaMa [75].

This is ancient. This research was done centuries ago. This is research about the possibility of isotopes, written about radium in 1903, published in 1946. It is a criminal level of journalistic malpractice to leave uninformed readers with the impression that this is where AI stands yesterday.


I'm sure this is multifactorial and the other factors are probably as or more important, but especially if you are familiar with the very egregious specifics of the FAA diversity hiring scandal (https://www.tracingwoodgrains.com/p/the-faas-hiring-scandal-...), "is unsupportable" here is a strong claim that is playing the classic and well-worn role that "there's no evidence" often does in bad journalism: https://www.astralcodexten.com/p/the-phrase-no-evidence-is-a...


I dug into one of the claims in the blog, and as far as I can tell, he's completely misinterpreting the evidence. The blog says:

> An FAA employee and then-president of the NBCFAE's Washington Suburban chapter, provided NBCFAE members with "buzz words" in January 2014 that would automatically push their resumes to the tops of HR files.

It's true that this person said that in the email, but if you actually look at the list of buzzwords, it's clear that this person was bullshitting and inflating his own importance (or maybe just fundamentally misunderstanding something). The list is on the last page of this document: https://drive.google.com/drive/folders/17Vi9dDtZvbwHDafrygRG... It looks like it's a page that was photocopied from a book about how to write a resume. It's a list of dozens of incredibly generic verbs like "manage", "analyze", "administer", "make", "improve", "design", etc. pretty much any resume will have at least some of these verbs. There's just no way you could build a system that would use these verbs to secretly screen in resumes of people who are in the know because everyone uses these verbs. A far more plausible explanation is that this guy was trying to make himself sound more powerful than eh really was.

The email this person sent is awful, but it seems to be one person's very misguided attempt to push an agenda, not some sort of secret plot by FAA.


You are conflating multiple different instances of current FAA employees providing outside racial identity groups with both hard and soft answer keys, the buzzword filter was just the first stage.

Later and probably most egregiously, there was a completely nonsensical and arbitrary biographical questionnaire which was scored like:

https://storage.courtlistener.com/recap/gov.uscourts.dcd.182...

    15. The high school subject in which I received my lowest grades was:
    A. SCIENCE (+15)
    B. MATH (0)
    C. ENGLISH (0)
    D. HISTORY/SOCIAL SCIENCES (0)
    E. PHYSICAL EDUCATION (0)

    16. Of the following, the college subject in which I received my lowest grades was:
    A. SCIENCE (0)
    B. MATH (0)
    C. ENGLISH (0)
    D. HISTORY/POLITICAL SCIENCE (+15)
    E. DID NOT ATTEND COLLEGE (0)

    29. My peers would probably say that having someone criticize my performance (i.e. point out a mistake) bothers me:
    A. MUCH LESS THAN MOST (+8)
    B. SOMEWHAT LESS THAN MOST (+4)
    C. ABOUT THE SAME AS MOST (+8)
    D. SOMEWHAT MORE THAN MOST (0)
    E. MUCH MORE THAN MOST (+10)
where current FAA employees, again, distributed the exact answer key to outside racial identity groups to give to their members.


If the buzzwords that Snow provided turned out to just be something he pulled from a resume book rather than insider information, why should we believe that the answer keys were any more accurate?


Look at the exam and scoring rubric: https://kaisoapbox.com/projects/faa_biographical_assessment/

Look at questions 29 and 33. The first (about whether negative feedback bothers you) is a plausible question but the grading is completely nonsensical. The second question, about art/dance classes you took in college, is nonsensical both as to the question and the answers. These seem obviously designed to be gamed with secret information.

This was used as a mandatory screen for several years. The FAA didn’t fix it, Congress found out and banned it. How many people at the FAA saw this and green lighted it?


I found the question and answer weights as well as some primary source documents on the methodology in the court file.

My interpretation is that this was not "obviously designed to be gamed with secret information", it was just bad methodology. They had a goal of screening out 70% of applicants, but the remaining 30% of applicants needed to have a demographic balance that would not constitute a disparate impact, and they had to do this in a legally defensible way. Working backwards from that goal, they took biographical data that they had collected in the 80s and 90s and constructed a test program that, they believed, would give them the result they wanted. So if the answer weights are logically nonsensical, it's not because they were building in a secret password, it's because that's what happens when you build a model that's overfit on a small number of datapoints.

https://storage.courtlistener.com/recap/gov.uscourts.dcd.182...

https://www.courtlistener.com/docket/4542755/139/24/brigida-...


Wow. I thought I remembered something like that but I thought it sounded too extreme and wouldn't be easily supportable, so I left it out of my other comment. So this test:

1) was designed to statistically select for members of favored identity groups and against members of disfavored identity groups, and not in any way to measure ATC job aptitude, resulting in highly-scored questions like "The high school subject in which I received my lowest grades was" where the only correct answer was "Science", and failing the test disqualified you permanently

and

2) current FAA employees distributed the exact answer key to outside racial identity organizations to give to their members


What were they fitting for?


What is your source for how the biographical assessment was graded? If you read the tracingwoodgrains blog post, he is saying that Shelton Snow, the guy from the association of black employees left voicemails telling people what the correct answers to the biographical assessment questions were. But if you read my comment above, this person seems to have been pretending to have insider knowledge and access - the secret buzzwords he sent around seem to just be a photocopy from a generic resume writing book. I don't think anything that Snow said about the application process can be taken at face value.

If you have another source for how the assessment was actually graded, I'd love to see it, but as far as I can tell, these claims are coming from a guy who seemed to just be making stuff up.


The scoring rubric is in the litigation documents (in the Google drive linked in the website). ECF 139-26 has one copy.


You already know.

https://generalstrikeus.com/aboutus

> What are your demands?

> Our broad list of demands includes, but is not limited to: Climate action. Universal healthcare. Racial justice. Reproductive rights. LGBTQIA+ rights. Living wage / raise the minimum wage. Immigration reform. Education reform. Gun safety. Tax the rich. Affordable housing. Disability rights. Welfare and child support reform. Voters rights. Constitutional convention. Paid family and medical leave. Criminal justice system reform. Workers’ rights. Permanent ceasefire in Gaza.

> Specific demands will come from leaders and experts of existing fights for racial, economic, gender and environmental justice.


I don't want to disparage any of these specific demands, I agree with a lot (not all) of them, but put together they're so overly broad as to be a pipe dream. They might as well say they want to start a new political party that's slightly to the left of the Democrats.

The one thing that does stick out to me though is the Constitutional Convention. This is a bizarre ask, and it must be how they see their demands being passed into, not only law, but an amendment to the Constitution. The problem is calling a convention is so difficult† that their strike will last, essentially, forever. The last and only time we had a convention was in 1787.

† This graphic illustrates the difficulty: https://en.wikipedia.org/wiki/Convention_to_propose_amendmen...


Well guns are already safe as long as you follow the rules.

https://www.nssf.org/articles/4-primary-rules-of-firearm-saf...

And there is a permanent ceasefire in Gaza since 01/19, which will last until someone decides to start firing again.

https://en.wikipedia.org/wiki/2025_Gaza_war_ceasefire

So they're already 2 for 19 on their demands. Impressive progress, and they didn't even have to actually strike!


When you see racial/women's/LGBTQ rights and Gaza in the same sentence you know how out of touch these organizers are.


Yep, I know exactly what kind of person wrote this.


With such a wide set of goals, this won't go far. They are limiting themselves to people that agree with all of that.


"> Our broad list of demands includes, but is not limited to: Climate action. Universal healthcare. Racial justice. Reproductive rights. LGBTQIA+ rights. Living wage / raise the minimum wage. Immigration reform. Education reform. Gun safety. Tax the rich. Affordable housing. Disability rights. Welfare and child support reform. Voters rights. Constitutional convention. Paid family and medical leave. Criminal justice system reform. Workers’ rights. Permanent ceasefire in Gaza."

Lol.

"Hey, Johnny, what are you striking against?"

"What've you got?"


Is it just "We refuse to work until the country conforms to our left-wing ideologies"?


Insane cope. Emily Bender and Gary Marcus still trying to push "stochastic parrot", the day after o1 causes what was one of the last remaining credible LLM reasoning skeptics (Chollet) to admit defeat.


Push what?


Anti AI grift and FUD and more or less awful takes, cashing in on their credentials to the detriment of their respective institutions.


Yeah they are definitely making a lot of money doing this compared to being on the other side.


AI will account for 99% of global energy consumption. Solve for equilibrium.



o1 can already do your job. Have you tried it?


I tried it.

o1 cannot do my job, and hallucinated immediately.


So you are using the old-stand-by-argument: "my job is safe, so if society collapses around me, I'm still all good, pass the beer".

It can't do my job, so it can't do any jobs?


Everyone thinks AI can do somebody else’s job. I dunno, maybe they are right, but it seems like there might be some bias. Can AI do your job?


This is the fundamental reality I keep seeing.

The people who think an LLM or other type of model can do a job are the people who don't know anything about that job.

Managers who don't know what developers do think you can replace a developer with an LLM. Can an LLM shit out some code? Sure, but that's hardly what (good) developers do.

Magazine publishers who don't know what editors do think an LLM can replace an editor. Can an LLM make a bunch of statements about the quality of a piece of writing? Sure, but they may have no basis in reality and will require a real human editor to review them. Or your publication can succumb to being LLM generated slop.

Bad coders who don't know what good coders do see that an LLM can do what they've been doing and think developers will be replaced but they don't actually realize what it means to be a developer so they don't see all the things the LLM isn't doing.

Tech bros think a model will be able to revolutionize materials development but when actual materials scientists look at the output it turns out it's mostly garbage. And crucially it took actual materials scientists spending a whole lot of time to figure that out. [0]

Most of what these models do is waste actual experts' time by forcing them to wade through huge quantities of plausible looking but completely incorrect output.

0: https://pubs.acs.org/doi/10.1021/acs.chemmater.4c00643


The main thing I see (and it is possible I’m biased by some of these same factors), is that it does seem to make some moderately helpful tools? Like coding assistants, to knock out boilerplate.

Maybe if it could make a developer twice as effective, it could halve the developers to project ratio. Jevons paradox, and we get twice as many projects, great. But the management requirements would be different, right? If teams are half as large, wouldn’t expect management to just, like, go away. But the tree might be able to lose some middle “summarize and pass up” levels, right?


Basically it helps with really generic stuff, but writing generic stuff isn't my job. I have access to windows copilot pro (basically ChatGPT 4 Turbo/ 4o depending of the time of the day) as a tester, and github copilot as a user, i can say without hesitation that ChatGPT is sightly better to write comments, but both are bad to do anything harder than writing HTML forms, at least from scratch.

They _are_ a great rubber duck though, today i had a concurrency issue and asked windows copilot for solutions. It was wrong, but gave me an idea, basically saving me at least 45 minutes. Github copilot is a great autocomplete, saving me some time too, but i don't think it can make me twice as effective as a coder. 20%? Maybe 40% if i take into account the fact that it generate really good test cases (that i still have to read)?

But coding is like 25 percent of my job, database/object design and software architecture are like 30%, network security another 25%, and the rest is meetings/coordination, so all in all, i don't think you can halve teams because you give them good genAI


Every code generator I’ve used makes me think developers jobs are plenty safe at the moment from an AI takeover. You can’t seem to spend your way to a solution of producing good data. Nvidia is certainly enjoying watching people try though.

What’s sad is the tech is actually impressive and fascinating but it’s being forced to look more useful than it is by greedy investors, but what else is new. Water is wet and all that.


Suppose AI can do anyone's job. Then, after massive layoffs, no one would receive a paycheck. How would that produce a burgeoning economy? Or indeed, any economy.


If we don’t need any services, we don’t need any people. And if we don’t need any people, we don’t need any services. Technically, “let’s all die” has always been a solution that balances the equations of the economy; actually it is the simplest one: 0=0. But, we’ve muddled along somehow.


It can if you have someone knowledgeable driving it. Otherwise it gives out wrong or outdated information enough that it would quickly cause bugs in production code. And that’s the thing it can do narrow scopes of a job well but ask it to do the entire task it will often mess up somewhere slightly enough to cause a snowball effect of failures by the time it’s “done”


It can take responsibility for production code and get fired when it produces sub-standard work?


We need to invent AI punishments to get it to be responsible. /s

Accountability requires that you have a body.


If your job is copy pasting stack overflow responses sure. If you work on something specific, need to talk to client, need to brainstorm ideas, it's still very meh at best.

chatgpt has been here for what ? 24 months ? Unemployment rate is the same as ever, maybe a bit on the lower side if anything. If it was the miracle they promised we would see it everywhere: gdp, unemployment rate, productivity, &c.


> Nobody thinks GPT4-o1 or Sonnet 3.5 is going to change the world.

I do. AI progress could magically hit a brick wall right now and never advance any further, and o1 would still change the world more drastically than you can imagine. But AI progress will also not magically hit a brick wall.

I supposed for that reason o1 specifically will not change the world, because it will be superceded too soon to. But it would if it wasn't.


Agreed. o1 is a step in the right direction. There are half a dozen improvements that could be made along the same lines without introduction of any new technology.


Canadian "all-purpose" flour has a higher gluten content than American "all-purpose" flour. It's closer to American "bread flour".

That's aside from whatever harder-to-measure taste differences it might have, but it makes a big difference by itself.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: