Hacker Newsnew | past | comments | ask | show | jobs | submit | thesz's commentslogin

Alphabet Inc, as Youtube owner, faces a class action lawsuit [1] which alleges that platform enables bad behavior and promotes behavior leading to mental health problems.

[1] https://www.motleyrice.com/social-media-lawsuits/youtube

In my not so humble opinion, what AI companies enable (and this particular bot demonstrated) is a bad behavior that leads to possible mental health problems of software maintainers, particularly because of the sheer amount of work needed to read excessively lengthy documentation and review often huge amount of generated code. Nevermind the attempted smear we discuss here.


So I went to check whether LLM addiction is a thing, because that's was a pole around which the grandparent's comment revolves.

It appears that LLM addiction is real and it is in same room as we are: https://www.mdpi.com/1999-4893/18/12/789

I would like to add that sugar consumption is a risk factor for many dependencies, including, but not limited to, opioids [1]. And LLM addiction can be seen as fallout of sugar overconsumption in general.

[1] https://news.uoguelph.ca/2017/10/sugar-in-the-diet-may-incre...

Yet, LLM addiction is being investigated in medical circles.


I definitely don't deny that LLM addiction exists, but attempting to paint literally everyone that uses LLMs and thinks they are useful, interesting, or effective as addicted or falling for confidence or persuasion tricks is what I take issue with.

Did he do so? I read his comment as a sad take on the situation when one realizes that one is talking to a machine instead of (directly) to another person.

In my opinion, to participate in discussion through LLM is a sign of excessive LLM use. Which can be a sign of LLM addiction.


Interesting how you've painted everyone who uses LLMs and LLM addicts the same color to steelman your argument.

Fortran is all about symbolic programming. There is no probability in the internal workings of Fortran compiler. Almost any person can learn rules and count on them.

LLMs are all about probabilistic programming. While they are harnessed by a lot of symbolic processing (tokens as simple example), the core is probabilistic. No hard rules can be learned.

And, for what it worth, "Real programmers don't use Pascal" [1] was not written about assembler programmers, it was written about Fortran programmers, a new Priesthood.

[1] https://web.archive.org/web/20120206010243/http://www.ee.rye...

Thus, what I expect is for new Priesthood to emerge - prompt writing specialists. And this is what we see, actually.


Fortran is about numerical programming without having to deal with explicit addresses. Symbolic programming is something like lisp.

Llms do not manipulate symbols according to rules, they predict tokens, or arbitrary glyphs in human parlance, based on statistical rules.


Fortran internally transforms expressions to assembly using symbolic computation. a+0->a, a+b->b+a.

LLMs are very much supported by symbolic computation. These "arbitrary glyphs" are computed by symbolic computation. The output of LLMs is constrained by program language grammar in code generation tasks. LLMs are often run with beam search, which is symbolic computation.


You are right about "LLM improving themselves" is impossible. In fact, LLM make themselves worse, it is called knowledge collapse [1].

[1] https://arxiv.org/abs/2404.03502


That paper again.

LLMs have been trained on synthetic outputs for quite a while since then and they do get better.

Turns out there's more to it than that.


These diminishing returns are exponentially diminishing returns [1].

[1] https://arxiv.org/abs/2404.04125

" We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly."


  > But we shouldn't pretend that you will be able to do that professionally for much longer.
Who are "we" and why do "we" "pretend"?

  > Will they ever carve furniture by hand for a business? Probably not.
This is definitely a stretch.

Catastrophic forgetting is overfitting.

No, it’s actually the math of overwriting. Imagine you hiked down into a valley Task A and settled there. Then, you decide to climb a new mountain to find a different valley Task B. You successfully move to the new valley, but in doing so, you destroy the path back to the first one. You are now stuck in the new valley and have completely 'forgotten' how to get back to the first one.

not exactly, not at all even in term of the way the llm are trained.

In RL it can be that you are not getting meaningful data anymore because you are 'too good' and dont get anymore the "this is a bad answer" signal so you can't estimate the gradient.


  > the huge gains in coding performance in the past year have come from RL, not from new sources of training data.
This one was on HN recently: https://spectrum.ieee.org/ai-coding-degrades

Author attributes past year's degradation of code generation by LLMs to excessive use of new source of training data, namely, users' code generation conversations.


Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.

And their “explanation” blaming the training data is just a guess on their part, one that I suspect is wrong. There is no argument given that that’s the actual cause of the observed phenomenon. It’s a just-so story: something that sounds like it could explain it but there’s no evidence it actually does.

My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said. I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.


  > Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.
I see "No True Scotsman" argument above.

  > My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said.
Reinforcement learning reinforces what is already in the LM, makes width of search path of possible correct answer narrower and wider search path in not-RL-tuned base models results in more correct answers [1].

[1] https://openreview.net/forum?id=4OsgYD7em5

  > I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.
The sources of training data already were the reasons for allegations, even leading to lawsuits. So I would suspect that no engineer from any LLM company would disclose anything on their sources of training data besides innocently sounding "synthetic data verified by ourselves."

From the days I have worked on blockchains, I am very skeptical about any company riding any hype. They face enormous competition and they will buy, borrow or steal their way to try to not go down even a little. So, until Anthropic opens the way they train their model so that we can reproduce their results, I will suspect they leaked test set into it and used users code generation conversation as new source of training data.


That is not what No True Scotsman is. I’m pointing out a bad argument with weak evidence.

  >>> It is a very contrived under-specified prompt.
No True Prompt can be such contrived and underspecified.

The article about degradation is a case study (single prompt), weakest of the studies in hierarchy of knowledge. Case studies are basis for further, more rigorous studies. And author took the time to test his assumptions and presented quite clear evidence that such degradation might be present and that we should investigate.


We have investigated. Millions of people are investigating all the time and finding that the coding capacity has improved dramatically over that time. A variety of very different benchmarks say the same. This one random guy’s stupid prompt says otherwise. Come on.

As far as I remember, article stated that he found same problematic behavior for many prompts, issued by him and his colleagues. The "stupid prompt" in article is for demonstration purposes.

But that’s not an argument, that’s just assertion, and it’s directly contradicted by all the more rigorous attempts to do the same thing through benchmarks (public and private).

We are talking about compiler here and "performance" referred above is the performance of generated code.

When you are optimizing a program, you have a specific part of code to improve. The part can be found with profiler.

When you are optimizing a compiler generated code, you have many similar parts of code in many programs and not-so-specific part of compiler that can be improved.


Yes, performance of the generated code. You have some benchmark of using a handful of common programs going through common workflows and you measure the performance of the generated code. As tweaks are made you see how the different performance experiments effect the overall performance. Some strategies are always a win, but things like how you layout different files and functions in memory have different trade offs and are hard to know up front without doing actual real world testing.

  > As tweaks are made...
  > ...how you layout different files and functions in memory have different trade offs and are hard to know up front without doing actual real world testing.
These are definitely not an algorithmic optimizations like privatization [1].

https://en.wikipedia.org/wiki/Privatization_(computer_progra...

To correctly apply privatization one has to have correct dependency analysis. This analysis uses results of many other analyses, for example, value range analysis, something like Fourier-Motzkin algorithm, etc.

So this agentic-optimized compiler has a program where privatization is not applied, what tweaks should agents apply?


  > This test sorta definitely proves that AI is legit.
This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.


> This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

It's still really, really impressive though.

Like, economics aside this is amazing progress. I remember GPT3 not being able to hold context for more than a paragraph, we've come a long way since then.

Hell, I remember bag of words being state of the art when I started my career. We have come a really, really, really long way since then.


  > It's still really, really impressive though.
Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?

Much like quoting Quake code almost verbatim not so long ago.


> Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?

No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation). That being said, they provide all the code et al for people to review.

I do agree that an out of distribution test would be super helpful, but given that it will almost certainly fail (given what we know about LLMs) I'm not too pushed about that given that it will definitely fail.

Look, I'm pretty sceptical about AI boosting, but this is a much better attempt than the windsurf browser thing from a few months back and it's interesting to know that one can get this work.

I do note that the article doesn't talk much about all the harnesses needed to make this work, which assuming that this approach is plausible, is the kind of thing that will be needed to make demonstrations like this more useful.


  > No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation).
This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].

[1] https://en.wikipedia.org/wiki/Leakage_(machine_learning)

This question is extremely important because test set leakage leads to impressively looking results that do not generalize to anything at all.


> This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].

I am quite familiar with leakage, having been building statistical models for maybe 15+ years at this point.

However, that's not really relevant in this particular case given that LLMs are trained on approximately the entire internet, so leakage is not really a concern (as there is no test set, apart from the tasks they get asked to do in post-training).

I think that's its impressive that this even works at all as even if it's just predicting tokens (which is basically what they're trained to), as this is a pointer towards potentially more useful tasks (convert this cobol code base to java, for instance).

I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.


  > I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.
Take any language with compiler and several thousands of users and you have a plenty of tests that approximate spec inward and outward.

Here's, for example, VHDL tests suite for GHDL, open source VHDL compiler and simulator: https://github.com/ghdl/ghdl/tree/master/testsuite

The GHDL test suite is sufficient and general enough to develop a pretty capable clone, to my knowledge. To my knowledge, there is only one open source VHDL compiler and it is written in Ada. And, again, expertise to implement another one from scratch to train an LLM on it is very, very scarce - VHDL, being highly parallel variant of Ada, is quirky as hell.

So someone can test your hypothesis on the VHDL - agent-code a VHDL compiler and simulator in Rust so that it passes GHDL test suite. Would it take two weeks and $20,000 as with C? I don't know but I really doubt so.


There are two compilers that can handle the Linux kernel. GCC and LLVM. Both are written in C, not Rust. It's "in distribution" only if you really stretch the meaning of the term. A generic C compiler isn't going to be anywhere near the level of rigour of this one.

There is tinycc, that makes it three compilers.

There is a C compiler implemented in Rust from scratch: https://github.com/PhilippRados/wrecc/commits/master/?after=... (the very beginning of commit history)

There are several C compilers written in Rust from scratch of comparable quality.

We do not know whether Anthropic has a closed source C compiler written in Rust in their training data. We also do not know whether Anthropic validated their models on their ability to implement C compiler from scratch before releasing this experiment.

That language J I proposed does not have any C compiler implemented in it at all. Idiomatic J expertise is scarce and expensive so that it would be a significant expense for Anthropic to have C compiler in J for their training data. Being Turing-complete, J can express all typical compiler tips and tricks from compiler books, albeit in an unusual way.


TinyCC can't compile a modern linux kernel. It doesn't support a ton of the extensions they use. That Rust compiler similarly can't do it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: