Hacker News new | past | comments | ask | show | jobs | submit login
AI and the Problem of Knowledge Collapse (arxiv.org)
189 points by kmdupree 11 months ago | hide | past | favorite | 127 comments



I don’t disagree much with the paper’s premise, except that I am turned off whenever there is a handy new tool for us to use and then some people come up with reasons why using the new tool is a harmful thing to do.

We humans evolved to be very flexible and naturally use what is in front of us to solve whatever problems we need to solve. For some tasks LLMs are very useful and for other tasks they don’t offer much help. I feel sorry for people who don’t learn how to use common tools because they are stuck in place or have unreasonable fears.

It is reasonable to worry about how powerful actors might use AI for surveillance and removing privacy rights, but I strongly disagree that it is a bad thing for individuals to sometimes save time and sometimes get better results by manually using tools like Perplexity and ChatGPT, and automating things using LLMs as sources of general common sense knowledge, as language processors, data transformation, etc.


The reliance on the tool in front of you can constrain you.

Some tools are fads, some make you less of a thinking person.

When I faced a second wave of IDEs I figured that any IDE is a fad - there will be better IDE in near future and your boss will require you to use it.

The use of IDE can make you know less about project at hand, because you always can investigate, as opposed to internalize/remember. The difference? If you know internal relations, you can judge changes easier. You are better programmer that way and I would say a better person.

The same goes to LMs - they help you not to know/remember and/or create thoughtfully. LMs constrain you to not know fully.


> The use of IDE can make you know less about project at hand

i disagree.

Someone who's not interested in the project might know less (but is still able to accomplish their task), where as without such an IDE, they will have to learn a lot first. It's one reason why beginner java courses at university asks you not to use an IDE (at least for the first few tutorials learning the basics).

But this isn't the fault of the IDE at all, and simply just reflect the motivations of the person using it.

An IDE makes someone motivated to learn be more capable. It's a multiplier/enabler.

> less of a thinking person.

no, it makes you think less unimportant stuff, and thus have more to dedicate to the important stuff. Effortless navigation in an IDE (like intellij) means you can move through the call graph of an application without having to see/understand the entire application completely.

The same can be said for LLMs - may be not today's, since it's still early, but at some point these LLM tools will make trivial/unimportant things easy, leaving time for you to focus on the things the LLM can't do.


And the written word helps you to not remember anything, to paraphrase Socrates, so better get to memorizing the Odyssey so you can be an even better person.


I got better things to burn my neurons on than memorization. I normally forget almost everything, only remember abstract stuff. Maybe it's because I have used web search for 25 years that I learned to remember the good search keywords rather than actual information. Is this like relying on maps for navigation instead of pure memory?


London taxi drivers that need to remember approximately 35 thousands streets and their more or less recent state, have two changes in their brains: they have a dedicated brain area that helps with navigation and their hippocampus is bigger and more dense than hippocampus of regular people of their age and comparable lifestyle.

Hippocampus is the brain's system to transfer knowledge from short term memory to long term memory.

If you do not have enough physical activity, it shrinks with age.

In other words, London taxi drivers are slightly more young brain-wise than people of their age.

Trying to have your memory fresh and live is good for your health.


Can you give more details regarding what IDE features you have in mind? I agree that java/c# people not knowing how to use the command line is bad (because they only know how to click "build project" etc) but IDE features such as LSP, compiler error messages as you type, intelligent completion, variable renaming, etc I think are essential tools.


I think that LSP is a Language Server Protocol. I do not use that. I can understand its utility but I do not use it.

"Compiler error messages as I type" distract me from, well, typing. I can pause to think and then I get an error message for my still incomplete code. I cannot help but notice that and that interferes with the train of thought.

I have trouble with my keyboard lately (two keys do not produce reliable key press events) and only then I started to use CtrlN in vim to autocomplete.

The need of variable renaming is often related to variables in most mainstream languages having multiple roles due to side effects: on input an array has one meaning, after execution of some function the same array now has different meaning. I prefer not to do so and greatly prefer languages with controlled side effects.


OK. For reference, I was an Emacs user for 20 years, and participated heavily in emacs extension development. I used to think like you early on. But, especially if you use languages with expressive/complex type systems, having the editor be able to inform you about types is very useful. So I understand where you're coming from, but you're probably being too absolutist and are upholding preferences that you should re-examine. For example, it's basically silly to write Rust without using rust-analyzer. I agree that feedback can be a distraction while typing but it should be possible to tune your IDE editor to not display anything until a specified amount of time has passed with no keypresses.


I do not write Rust. My day-to-day job is C++'s application support and, frankly, what Rust offers does not help there.

When I do programming for fun, I prefer to use Haskell into which "Rust" can be embedded to, if I need that. But, the problems I am interested to look deep into do not benefit from borrow checker's checks. Sparse non-negative matrix language modeling is not a thing Rust can help with. ;)

(any continuous problems like ML do not benefit from borrow checker)

And I do not use emacs. I use vim. As an EDITOR for midnight commander. I go to ghci when I need feedback, not earlier. ;)


> You are better programmer that way and I would say a better person

Lost you here. A better person by using old school IDEs? What are you smoking

A better person used LLM and whatever tool he gets at disposal to get job done ASAP and then goes outside to enjoy life with fellow humans


What you do with LLM will affect other persons, especially if the result is code. The cumulative time others spent reading your code will be much longer than you writing it.

You go hang up with your five-ten friends and dozens of other men and women will read your code meanwhile.

LLMs do not help you not to produce smaller code, I see no such goals being pursued.


That's why I make an effort to report anyone whose code I have to read to the police. They are immoral people. All codebases should be a single character.


> You are better programmer that way and I would say a better person.

You think not using IDEs makes you a better person?


IDEs add overhead, thus burn CPU cycles, thus increase your CO2 footprint. Not using IDEs is the only moral choice.


So true. The only moral dev tool is a HEX editor.

Or, for those less capable, also add a C compiler, as any newer language is guaranteed to have either runtime overhead or wasteful compile times.


If an IDE makes you more productive, you can spend the extra time optimizing your code to reduce CO2, so it depends.


The only moral choice is to turn the computer off.


"Real programmers use COPY CON PROGRAM.EXE"


What is a fad to you?

To me a fad is a topic of general interest, time-boxed to a short period, and is something frivolous: pet rock, hula hoop, sneakers with wheels.

I think we have different definitions.


>I am turned off whenever there is a handy new tool for us to use and then some people come up with reasons why using the new tool is a harmful thing to do.

Same, but it turns out its complicated. There is something materially different learning French and using your phone to translate French. Or learning to play piano vs listening to a recording of someone playing the piano. The danger is that these tools give make you feel that you can do something when you can't.

> For some tasks LLMs are very useful and for other tasks they don’t offer much help

Yes, and the difference is that I don't care about bash so I let it write scripts for me, but I DO care about other languages so I generally write my own. I'm basically curating my own skillset according to value and capacity. In that way AI can be a big win for teams of approx 1 who only need pro forma help in certain areas, but deep expertise in others.


> The danger is that these tools give make you feel that you can do something when you can't.

i dont think people are so stupid as to feel this way. It didn't happen with cars - you could travel hundreds of miles in less than an hour with it, where as without it, you can barely do 5. And yet people have never made the mistake that they could travel that fast themselves.

LLMs as a tool is not more harmful than what could be done with existing tools today prior to the advent of LLMs. And the benefits of the LLMs should outweight the harm imho.


> i dont think people are so stupid as to feel this way. It didn't happen with cars - you could travel hundreds of miles in less than an hour with it, where as without it, you can barely do 5. And yet people have never made the mistake that they could travel that fast themselves.

But cars are in a totally different category insofar as they do one very specific thing and are bound by very obvious physical limits. Our own physical limitations are very obvious to us, our intellectual limitations, I think, are less so. Further, companies (I think knowingly) market AI (read: llm chatbots) as though it's some general, infallible solution to all intellectual and digital content problems. This is obviously not the case, and if you use a chatbot to work in a domain in which you have any expertise whatsoever, you'll quickly uncover this. The problem is people without expertise using these tools and being fooled into thinking they've plumbed the depths when they haven't tread past the shallows. Physically speaking, your capabilities are very apparent when you attempt to run 100mph and simply *cannot*. Realizing you cannot do something or don't understand something correctly intellectually is a far different scenario.


>but I strongly disagree that it is a bad thing for individuals to sometimes save time and sometimes get better results

you shouldn't really use tools to save time or 'sometimes' get better results, because that's an indication of a sort of laziness. When you use powerful tools you need to understand exactly in what way they save you time or effort, what their failure modes are, otherwise they become pathways for exploitation.

The recent backdoor is a good example of this. People pile automation and tool chains in software on top of each other but don't bother to actually verify what comes through them, and so rather than leverage for them they became leverage for someone else. If you use LLMs to process information faster but someone feeds you garbage you're now the fastest consumer of garbage. There's nothing worse than a gullible user with a powerful tool.

And given that the general attitude we have nowadays is that tools are a sort of thing that allow you to turn your brain off, rather than understanding that you need to be smarter and more critical and more alert the more tools you use, it's very reasonable to be extremely conservative about adopting new tools.


I'm not sure exactly what you're disagreeing with. There are plenty of examples out there of individuals using these tools badly. E.g., the teacher who used ChatGPT to generate student reviews. The students who use them as high-quality plagiarism engines. Individual marketers using the saved time to spam many more people. People who want internet points generating forum comments and Stack Overflow answers. Random kooks generating disinformation, propaganda, and hate. Et cetera, ad nauseam.

Given that there are plenty of examples of people causing harm, I'm not sure how you can disagree with the notion that people can in fact use these tools harmfully. If a tools solves my problem by creating a problem for you, why shouldn't people point that out?


> using these tools badly.

i assume you mean unethical usages rather than bad.

You believe it causes harm, but who gets to decide what is ethical? I certainly do not delegate my judgement to you - only I can judge what i do myself.

Therefore, the only metric to judge is through the lens of legality (because everyone "agrees" what legal is).


> Therefore, the only metric to judge is through the lens of legality (because everyone "agrees" what legal is).

I just want to make sure we're all on the same page here: You believe that prior to Lawrence v. Texas, being a gay man in many of the US states was unethical? That prior to 1865 chattel slavery was ethical?

I often feel like HN has devolved into high-school level science fiction and philosophy, but this is rather on the nose.


> That prior to 1865 chattel slavery was ethical?

some people in that era did believe it was ethical, and some didn't. After a while, more and more people were convinced that it was unethical (and conveniently, machinary made slavery less economical for some industries). Once enough people believed/changed their minds about it being ethical, there was a large argument that resulted in laws rather than on individual ethical beliefs.


Legal is what we collectively decide. Ethical is what we personally decide.

Clearly something is either legal or not, and we have mechanisms setup to clarify where there is dispute.

Some professions have ethical standards. Some even have enforcement of them, making them pseudo-legalities.

But personal ethics, are, well, personal. You may find it ethical to eat meat, I might not. You may consider it unethical to eat cheese, or coffee, or honey. You might consider it ethical to pay minimum wage, or require employees to answer emails after hours. Or rent out an apartment, or wear white in October.

Ethics are personal and vary enormously. Declaring things to be ethical or not is therefore somewhat unhelpful. Clearly some will agree, some will not.

Whole college courses exist to try and define ethics and to encourage students to at least think about what their own ethical boundaries might be.


> Legal is what we collectively decide.

It's not even that. The collective didn't ask for sodomy laws. The collective didn't repeal Roe v Wade.


enough did that it passed, because the USA's system of first-past-the-post voting has such a loophole.


Sure, if you define "collective of american voters" as "5 unelected people", then sure "enough did".


This is both incredibly historic and clumsily sidesteps the fact that were you in that time, according to your statement, you would have found it ethical.


> you would have found it ethical.

not having lived that era, it cannot be said which side i would've been on - because it would've depended on which side i was born, and the circumstances i find myself in. The truth is that which ever "ethical" belief that benefits me the most will have been the belief i followed.

Most people believe that their beliefs are ethical, but are in reality, just post hoc rationalizations of decisions that benefit themselves (knowingly or not). I am just being clear, rather than try to hide this fact.


> the only metric to judge [what is ethical] is through the lens of legality

You've told us exactly what "side you would have been on".


Yeah, HN has a strong vein of Dunning Kruger running through it. Anybody taken in by the waffle about the law being the only metric to judge actions has never thought about where laws come from. Or read MLK's Letter from a Birmingham Jail, which clearly and passionately discusses the topic.


Ya, mean, what happened with the invent of the calculator? Sure maybe basic math skills declined…but i think i grasp relativity more than my father does and ever will…


> I am turned off whenever there is a handy new tool for us to use and then some people come up with reasons why using the new tool is a harmful thing to do.

So we shouldn't investigate (theoretically and practically) potential harmful effects of a new effects? Should we treat with coldness (dare I say disgust) anyone who dares?

You know, there is a big hole in the ozone layer that is now closing (or at least no expanding over the whole planet) because some poor researchers went against the grain and investigated all effects of this new invention called CFCs.


>We humans evolved to be very flexible and naturally use what is in front of us to solve whatever problems we need to solve.

Ha, yeah right. Almost all of us spent the vast majority of our time and energy just making dishonest rich people richer. In a world so full of exploitation, it should be no surprise that so many people are distrustful.


Is it the AI that's the trouble or the hostile new information environment we're expected to navigate and survive? Expecting us to remain sane amidst these torrents of information without new tools for querying and filtering it is cruel.


I only skimmed the paper, but judging by the abstract/intro and the definitions at the end, it's not concerned with "hostile" anything, but rather, with AI mediating access to knowledge.

Specifically, they mention that "knowledge collapse" is the narrowing of the "human working knowledge" and the universe of things "worth knowing" or "epistemic limit", which may result in obscuring/hiding what they call "the long tail of knowledge" (they define all these terms). They explain this better than me, but I think it's all about constraining the human relation to knowledge by these new "recursive AI" processes, without any malice involved.

I don't know if the paper mentions this, but I've thought about this (in less smart terms). It's probably not that AI is a radically new process in this sense but, like a lot of tech, it makes it faster.

We often say "well, things were like this with [written language|printing press|the internet]" but isn't it possible that a relatively harmless process could be accelerated to a harmful level by technology, even if it's not "fundamentally" something new?

Can we always fall back to the belief that "people from previous generations complained of $THING and it turned out ok, therefore $NEW_THING is probably OK too"?


Anyone who needs to know more than the narrow middle of knowledge will seek it out.

The real issue as I see it is people who have that kind of knowledge need to share it and have it be seen by those who are interested in it.

There’s a huge problem with search engines serving junk ‘narrow middle’ SEO-optimized content to the endless horde, as opposed to real, quality knowledge. This issue predated generative AI.


> Anyone who needs to know more than the narrow middle of knowledge will seek it out.

This is what happens now, but do we know it will remain so? If AI is carelessly integrated into the teaching loop, I suppose it may narrow the knowledge even there. And then, how will people know there's more to be known?

And even outside formal teaching, how will regular people know there's more to be known beyond what AI tells them?

How will you even know you "need to know more"? Who will tell you?

Already we see a lot of what people "know" (about anything: biology, WW2, ancient history, politics, art) is mostly informed by Hollywood and internet-spread "pop knowledge" (directly or indirectly, as in watching a documentary or reading a newspaper article that were themselves informed by regurgitated "pop knowledge").

(In case anybody is wondering, I don't consider myself outside of this. I know I'm impacted!)

So this phenomenon isn't new... but will AI make it worse?


The problem is that people assume others are too stupid for technical information. So its siloed. People in the field have learned to abrogate a search engine to return technical results (e.g. looking for direct links to documentation or using a search tool like google scholar to read scientific literature). However, most people aren’t taught these skills in the general sense. All they might know is that narrow scope written to a 6th grade reading level. On the other hand, if we are able to get more people ready access to the ground truth technical information on different topics to people instead of narrow scope information that often leaves you misinformed, I’d imagine that would kick start an untold era of innovation and progress for our species. Too bad using our computer networks to distribute advertising and propaganda is seen as more profitable than future advancements, that I guess must be liable to disrupt the profitable status quo.


The baseline will always shift to accommodate the conditions, it's how the hedonic treadmill evolved. The perspective we have, stuck inside the system, doesn't make the people of Denmark and the people of Burundi equally happy.


> but rather, with AI mediating access to knowledge.

Rephrasing McLuhan "The medium is the m[ea]ssage" we can say that "The AI is the medium and the m[ea]ssage".


The information age is nothing new (for us), neither is media culture generally, or even silicon valley. AI will now be some new part of the complex.

It seems obvious to me that a significant step to combat this effect would be not calling the thing "artificial intelligence". I like the term gpt because it is descriptive and accurate. Gpt doesn't suggest in any sense that we can rely on the thing or are comparable, in the same way "cell phone" doesn't. Imagine if we really bought into the idea that phones are truth receivers, or called them the closest friend. But that seems to be the idea with AI.

Keep speaking up too. Just calling bullshit on the thing implicitly gives people permission to turn it off. The tech will not cause knowledge collapse among a population that doesn't need it. I do use it by the way, eg GitHub Copilot which is nice.


> The tech will not cause knowledge collapse among a population that doesn't need it.

Every google search now returns loads of AI generated websites fill of low quality information. You could call the things "bullshit generators" and the people who really want to make websites full of bullshit will still use them to flood the internet with that bullshit. We all suffer the degradation of the internet when that happens.


I'm with you. I started moving away from Google a few years ago. That's what I mean I guess -- when something becomes so full of it, what can you do but turn it off? It's Google's own conceit that without it we'd be in the dark, Facebook's that without them we are lonely and alienated, missing out, etc. In actuality they are entirely dependent on you and are terrified you'll look away. They are enormous media companies and nothing more.

Again, it's good that you call bullshit on it. Things are not as set in stone as they appear -- the tech is finicky and expensive, and we still have the wealth of "legacy knowledge" online and offline. There is enormous investment being made in this sector currently. Remember the huge amount of bullshit that was spread all over during the height of the cryptocurrency craze, uberization, social media, war on terror, etc. As I said, "AI" is a new part of the complex. Turning these things off is a way to stay sane, no more or less.


> Every google search now returns loads of AI generated website

That is a google problem, not an AI problem.


It’s society’s problem, enabled by changes in technology around AI and LLMs. We have the potential to build the world’s greatest repository of knowledge and what little progress we had made has been eroded in recent years by Ai technologies. It seems reasonable that chatbot style LLMs can help, but they currently do this by replacing a sea of web pages with a single corporate interface.

I think it’s worth asking questions about how our data is being used to fuel corporate interfaces to that data, who owns the rights to that data and those systems, and consider how we might maintain independent creators on such a system. Will iFixit still have the resources and means to take high quality documentary photos and write ups for new devices if everyone consumes that information via an Apple, Google, or OpenAI interface? How will you post a structured, expandable, localizable repository of your own data to share, aka how will you publish web pages and how will people find them?

I believe society will solve these problems via the actions of many individuals and organizations, but with certain organizations holding so much power it’s important to consider and inquire on how that power is being used.


It can be (and is) both at the same time.


Maybe the problem (which seems easily fixable), is more "rizz collapse", aka blandness, than this "knowledge collapse".

The model hasn't forgotten the diversity of material it was trained on, but outside of a context predicting a "long tail" response, it's going to predict a mid response. You can always prompt it to respond differently though.

Blandness is more of an issue since that's what most-probable word-by-word generation is going to give you, rather than the less predictable, but more interesting, responses that an individual might give. Prompting could help by asking the model to reply in the idiosyncratic style of some celebrity, but this is likely to come across as a cheesy impression. Maybe the models could be trained to generate conditioned on a provided style sample, which could be long enough to avoid the cheesiness.


> You can always prompt it to respond differently though.

No, you cannot.

I rarely do anything with LMs [1], but if I do I ask them to code "blocked clause decomposition" [2] in Haskell. Blocked clause decomposition is very, very simple as far as algorithm's implementation complexity goes and Haskell is just an unusual but simple implementation language.

  [1] https://arxiv.org/pdf/2403.07183.pdf
  [2] https://www.cs.utexas.edu/~marijn/publications/bcd.pdf
I tried that on one less known model (my son sent me a link) and on Phind two times.

The Phind's responces were very illuminating ones.

On a first try, it tried to make me code that myself by pointing to a paper that does not contain any blocked clause decomposition algorithm(s), just worthless combinations of them and references to other papers. Of course, I read that paper before. When I specifically asked to provide me with code, it did not even provided me with the correct type of blocked clause, which should be a pair of a CNF clause and a literal that blocks it.

On a second try I could not even pass through bullshitting phase - I just was not able to persuade a LM to write any code. The second try I attempted after I saw an advert here on HN about Phind's new and updated LM, better at coding.

Try that youself with your unusual but known problem of choice.

And this is exactly a problem of collapse of knowledge: you can't ask an LM to code you a problem it hasn't seen several dozen of times during training. You just cannot, you will not get a satisfying answer.


I believe if you pay for Phind Pro it'll give you access to GPT-4 and Claude Opus, which are the two most capable models.

If you are just using the free version of Phind, then it uses some MUCH smaller and less capable model of their own. Even if you use Phind Pro, or Phind Plus, it says their own most capable model is Phind 70B (tiny compared to GPT-4 class models).

You can't just use Phind and take that as indicative of what state-of-the-art models can do, without knowing what model you are actually using. You also can't expect any of these models to do as well with Haskell as they do with Python or Java which they will have seen massively more of during training.


> You also can't expect any of these models to do as well with Haskell as they do with Python or Java which they will have seen massively more of during training.

This is exactly the person’s point.

Did you try this Haskell problem with a paid version and can provide an example that actually disputes the described experience? Why would someone pay for the premium version of a product when they don’t perceive the free version as working?


I uploaded the paper to chatgpt and it summarized it, wrote pseudo code and wrote haskell code with very little prompting from me... no idea if it's _good_ code, because I didn't run it and I don't know the domain or haskell, but it did produce code -- and explained that some of it is non trivial to implement and might need more attention and explained why.


But this is _exactly_ the problem being discussed.

Domain expert: “I cannot get it to produce something that even approximately resembles x”

Person: “I gave it the paper and I got maybe x?”

There is a fundamental knowledge gap here, just because you can get it to do something isn’t the goal, or even a good outcome. The illusion that it’s capable of doing something it evidently cannot is a dangerous one, because people who don’t know better, and are being convinced that they do understand or comprehend when they don’t.


Okay so you have no idea how good the output is.


> And this is exactly a problem of collapse of knowledge: you can't ask an LM to code you a problem it hasn't seen several dozen of times during training. You just cannot, you will not get a satisfying answer.

I think that's more a function of the coding ability of the model you are using, and obviously they are going to do much better on languages better represented in the training set.

What the paper is referring to as "knowledge collapse" isn't inability to do something, but rather getting a mid response since by definition that's going to dominate any normal (pardon the pun!) distribution.

Here's what he said:

"While large language models are trained on vast amounts of diverse data, they naturally generate output towards the 'center' of the distribution."

But, you can generally prompt around this sort of thing by asking the model to adopt a preferred context such as "you're an expert xxx", providing of course that is did see such expert-tagged material in the training set.


> obviously they are going to do much better on languages better represented in the training set.

Which, again, indicates that the current crop of LLM’s are just fuzzy-recollection/fuzzy-interpolation machines, and have none of the actual thought/reasoning capabilities some people are claiming they have.


If that were the case, then you wouldn't get the transfer learning from models being trained on programming languages, to reasoning applied in a non-programming context, which has been universally observed for all these models. They are learning abstract "thought patterns", not just interpolation.

Learning a programming language (or any complex skill) just by reading a book and looking at examples, without actually practicing it yourself, is basically impossible, even for ourselves, let alone an LLM. It is amazing how well the LLMs can do at programming in cases where they were trained on masses of well-commented examples, but without also being able to learn "on the job" themselves, you need to tamper expectations.


If that were the case, it would be impossible to describe rules to them and have them follow those rules.


I think Chomsky's idea that LLM's are a form of automated plagiarism is the best description if we think of plagiarism in a neutral sense and not a highly negative sense.

If we think in terms of automated plagiarism, no one can complain that the model doesn't produce original ideas.

Automated plagiarism could have many uses but it is obviously stupid to expect an original idea from plagiarism. We wouldn't complain the model can't reproduce things it hasn't seen before if we think in terms of plagiarism.


Chomsky has been wrong for decades. He doesn't want to believe that LLMs are really learning language because that conflicts with his assertion that it takes something much more specialized to do so.

Basically Chompsky has been proved wrong by LLMs, but rather than believing what is in front of his eyes he is saying "that can't be right, because it'd mean I am wrong".


“Chomsky is wrong because I say he is, and because he doesn’t like the LLM I like”

Mmmhhmmm yes very convincing argument there. I’d be willing to bet that Chomsky’s arguments don’t rely on such spurious arguments. Automated plagiarism reads like a fantastic description of current LLM’s to me.


Do you disagree that LLMs appear to have learnt language? Whether they plagiarize or not is irrelevant as long as they are not quoting verbatim. If they are coming up with their own sequence of words, making sense, and remaining grammatical, then how can you argue that they have not learnt language?

Since they do at least (whatever other shortcomings they may have) appear to have learnt language, just by "listening to" a bunch of examples, and since they are based on a transformer, not some specialized Chompskian "language organ", then the thesis that a "language organ" is needed is disproved.


Even assuming LLMs have learnt language (whatever that means) it is completely irrelevant to what Chomsky is studying which is how the human language ability works. As an analogy training a neural network to successfully predict the weather doesn't falsify physics-based models of the weather.


Sure, but for our cortex not to be able to learn language without becoming specialized for it, while a transformer can, implies that a transformer is a more powerful prediction engine than our cortex, which seems unlikely.

If you look at the cortex as a whole, what seems most striking is how uniform it is - same basic architecture of six layers of neurons with a specific pattern of inter-layer connectivity. It seems nature has come up with a universal prediction architecture. Maybe Wernicke's areas is fine-tuned for language, but to characterize it as a specialized "language organ" seems a bit of a stretch. Let's note too that language is only a million years old, while the cortex itself is 100's of millions, yet has this mostly uniform architecture that evidentially works just as well for vision as hearing, etc, etc.

So, sure, the ability of the ridiculously simple transformer architecture to learn language (and many other prediction tasks you throw at it), doesn't PROVE that the brain didn't have to evolve a highly specialized way to do it, but it certainly seems highly suggestive of it.

Since we now have an existence proof that a very simple architecture, not specialized for language, can learn language, it seems the onus is now on Chompsky to put some meat on the bones of his claim (without evidence) that the general cortical architecture is incapable of this without a high degree of specialization.


> Sure, but for our cortex not to be able to learn language without becoming specialized for it, while a transformer can, implies that a transformer is a more powerful prediction engine than our cortex, which seems unlikely.

On the contrary, it's extremely likely that we can engineer something better than random evolution. And we have: calculators are better at adding and subtracting that humans are, bikes are faster than running, etc.

> If you look at the cortex as a whole, what seems most striking is how uniform it is - same basic architecture of six layers of neurons with a specific pattern of inter-layer connectivity. It seems nature has come up with a universal prediction architecture. Maybe Wernicke's areas is fine-tuned for language, but to characterize it as a specialized "language organ" seems a bit of a stretch. Let's note too that language is only a million years old, while the cortex itself is 100's of millions, yet has this mostly uniform architecture that evidentially works just as well for vision as hearing, etc, etc.

Language is ~100,000 years old, not a million. The brain is in fact not uniform: If you damage Broca's area you lose language capability. And to say that all cognitive function is the same general algorithm ignores the obvious fact that the brain performs different functions and doesn't answer the question of how and why this is so. There are lots of cognitive behaviour that is not understood, if you want to explain those behaviors you implicitly have to distinguish them from one another.

> So, sure, the ability of the ridiculously simple transformer architecture to learn language (and many other prediction tasks you throw at it), doesn't PROVE that the brain didn't have to evolve a highly specialized way to do it, but it certainly seems highly suggestive of it.

It doesn't. Moro showed in a series of experiments that humans have difficulty learning non-hierarchical languages and use different parts of the brain to do so, which is highly suggestive that language is specialized.

> Since we now have an existence proof that a very simple architecture, not specialized for language, can learn language, it seems the onus is now on Chompsky to put some meat on the bones of his claim (without evidence) that the general cortical architecture is incapable of this without a high degree of specialization.

He has provided evidence and arguments, some of which I have pointed to above. Maybe you should actually read or listen to him.


> Moro showed in a series of experiments that humans have difficulty learning non-hierarchical languages and use different parts of the brain to do so, which is highly suggestive that language is specialized.

The cortex is built for hierarchical processing, because that's what's needed to model the world we live in. Physical objects are localized and have hierarchical detail, and larger visual scenes are the same. The kind of sequential (temporal) patterns relevant to us are also hierarchical, whether in visual, auditory or other domains.

The type of connectionist architecture needed to recognize hierarchical patterns is a layered one where the receptive field and hierarchical level of abstraction grows as you ascend the layers. In our brain those layers come from different patches of our cortical sheet being connected. This is the reason that neural network architectures like CNNs and transformers also work to recognize hierarchical patterns in visual and temporal domains - because they both also uses these layered architectures, which is all that is needed.

The reason why the function of damaged brain areas can't always be taken over by other areas is largely due to plasticity. Our brain peaks in it's ability to form new synapses in the first few years of life. If you haven't learned language by age of 3, then you will never be able to learn more than a crude type of pidgin language, depsite all your "language areas" being intact. The same would be true of different part of you brain trying to learn language as an adult - the plasticity is no longer there.


> But, you can generally prompt around this sort of thing by asking the model to adopt a preferred context such as "you're an expert xxx", providing of course that is did see such expert-tagged material in the training set.

So, I need LM to be an expert in SAT solving and an expert Haskell programmer. This means I need LM to be trained on the SAT solving and on vast amount of Haskell code.

I think it is already easy to see that LM should be trained in an intersection of quite fringe (yes, I am watching that TV series right now) things.

I also think that it is beyond reason to ask for this.


Please get back to us when you try whatever you’re trying to do with a good model (gpt4, opus, or ultra).

I recommend refraining from making (and especially from posting) any conclusions until then.


> a good model

Strongly reminds me of "no true scotsman."

If you have an access to a good model, try that simple problem of blocked clause decomposition and please get back to us.


I do have access to good models, but you are the one making far reaching statements about LLMs.

Please preface your future posts on the topic with “I’ve never tried any decent models, so I don’t know what I’m talking about”.


I will include such disclaimer after you prove me wrong as you have means to do that.


Sounds good. What prompt do you want me to try?


"Please implement blocked clause decomposition in Haskell"

I, actually, asked my friend to ask whatever LLM he paid for to perform this experiment. I think he uses latest or previous GPT. At least he is paying for it which I cannot do and won't do in foreseeable future.

Just like I said above, not only LLM failed to properly implement the algorithm, it failed to provide even remotely correct types for clauses and blocked clauses.

What happened next amused me like nothing else related to AI.

The exchange that followed incorrect initial attempt, GPT clearly tried to persuade my friend to implement the correct version for it, constantly providing him with partially (and stupidly) incorrect code.

This is exactly what MathCraft [1] does.

[1] https://en.wikipedia.org/wiki/Cyc#MathCraft

And this behavior is clealy counterproductive. This is not what one might expect from anything that is advertized as a good assisting measure. I can use ghci, vim and actual paper with much greater effect.


The LLM under test was GPT 4 turbo.


>This means I need LM to be trained on the SAT solving and on vast amount of Haskell code.

Essentially why the current batch of top-tier LLMs are trained on a pile of text amounting to “the internet”, containing every haskell q&a on stackoverflow, every sat implementation on github, and every explaination of those terms on wikipedia. It might seem unreasonable without the sense of scale and expense of building these things; but yes, it has those input data in its training set.


Blocked clause decomposition is implemented in, I think, two SAT preprocessors in SAT solvers, both are C++. There are about three to five different articles on the topic.

For LLM to implement BCD in Haskell it needs to be able to translate between Haskell and C++ and it is well beyond trivial.


Or introduce more noise or seeding to get more interesting responses. The `temperature` settings don't really satisfy this right now. I would like some determinism - but seeded randomly - so I can get similar responses if I like what is produced. Likewise some kind of metadata or explicability that allowed us to take a known style or featurespace of the model, perhaps from hand prompting, and then reuse-it with some degree is modification and maybe even combination from others would be very helpful. The work around adding model weights from fine-tuned seems directionally what I'm talking about, though that isn't the form I'd want to expose to users.


That's interesting.

In keyword-based indexing solutions, a document vector is created using "term frequency inverse document frequency" scores. The idea is to pump up the document on the dimension where the document is unique compared to the other documents in the corpus. So when a query is issued with emphasis on a certain dimension, only documents that has higher scores in that dimension are returned.

But the uniqueness in those solutions is based on keywords being used in the document, not concepts.

What we need here to eliminate "blandness" is conceptual uniqueness. Maybe TF-IDF is still relevant to get there. Something to think about.


> Informally, we define knowledge collapse as the progressive narrowing over time (or over technological representations) of the set of information available to humans, along with a concomitant narrowing in the perceived availability and utility of different sets of information.

> The main focus of the model is whether individuals decide to invest in innovation or learning ... in the ‘traditional’ way, through a possibly cheaper AI-enabled process, or not at all. The idea is to capture, for example, the difference between someone who does extensive research in an archive rather than just relying on readily-available materials, or someone who takes the time to read a full book rather than reading a two-paragraph LLM-generated summary.

> Under these conditions, excessive reliance on AI-generated content over time leads to a curtailing of the eccentric and rare viewpoints that maintain a comprehensive vision of the world.

My intuition is that AI will just accelerate the trends that the internet brought on, which is that eccentric viewpoints are actually pretty common, even ones based on research and in fact. The internet people mostly use has become relatively generic, consumed through a pretty narrow, curated aperture (social media). This feels analogous to getting it through AI, as described in the article. Yet, people are still learning about eccentric, marginal stuff all the time, especially compared to, say, 50 years ago.

Assuming the AI's responses aren't artificially limited, people who are interested enough to look will still get to learn about topics in the long tail of the distribution, even in a world of ubiquitous AI. And they'll be able to dive as deeply into them as they do today. I'm not really worried about that.

If anything, the knowledge collapse will be at the center. Basic liberal education topics are what will go away. Or rather, they will be offloaded to AI. In the same way that people say they don't need to learn arithmetic because they have a calculator, my guess is people will be more likely to decide not to worry about what previous generations considered core knowledge: history, geography, the canon, and so on. "I don't have to know it, I can look it up". That'll all go away even faster than it's going now.

(I don't think this is a good thing, just stating the most realistic outcome based on extending what I've seen)


> decide not to worry about what previous generations considered core knowledge

these sorts of things have already been happening for centuries. What was considered core knowledge depends on the environment one lives.

Way back in prehistoric days, foraging and identifying edible plants would be considered core knowledge. How many people do this today?

Prior to industrialization, most farmers will have some experience doing blacksmithing (for their own tools/repairs etc). That would be considered core knowledge that is lost today because industrialization made it unnecessary completely. Ditto with sewing, and many other skillsets that would be considered artisanal today.

If knowledge acquisition turns out to be like that too, then it's fine imho.


I feel like this has always been the case. The entire information economy is based on a few key publishers and figures. You see it in news, academia, social media -- there's orthodoxy everywhere. Not sure how AI is any different.


In the 1990s people read their town's newspaper. Now people in Arizona read the Daily Mail


I worked for a local newspaper in Kansas around 2003/2004 and one thing I found surprising was that journalists there were frequently on the hook for writing up national stories - things that would come in off the wire services and then be re-written for the local audience.


Did they write it like What in tarnation is going on in the Big Apple?


In the 1990s people who wanted to advertise in that town had to do so in local media. Now they can ad tech and the Daily Mail will arrange for it to be served.


It's probably different in that the few key publishers reached few people, but now AI is "sold" as for everyone and it's not unreasonable to expect it to become consumerized into a magic box of information, news and problem solving in the near future.

It may not be a fundamentally new process, but what if it becomes so ubiquitous and fast that it becomes really harmful? And would there be a point of no return for society if this happened?


Key difference is incentives. I'm optimistic if AI will remain a subscriber model and avoids an advertising model.


I don’t expect they leave paid product placement off the table for good.


So by this definition, do we already have "knowledge collapse" by Wikipedia? Because if you search for a random concept, that's usually the first hit, and it's also what countless other sources draw on.


The same warning was given for Google. Except those people added that it would reduce problem solving ability, too. People would get used to whatever simple, instant content rose to the top. They'd gradually lose some or all of their ability to figure out the same things on their own. One submission here was a tech guy at a school saying that was already happening where he worked.


I mean, it does - people search stuff all the time now, rather than thinking about it.


IIRC, that was Socrates' complaint to Phaedrus about writing: that reading (because it was "high tech" at the time?) led only to an illusion of understanding.

Elsewhere Phaedrus echoes with a very modern complaint (even though search engines wouldn't arrive for another 2'300 years): They would say in reply that he is a madman or a pedant who fancies that he is a physician because he has read something in a book, or has stumbled on a prescription or two, although he has no real understanding of the art of medicine.

https://www.gutenberg.org/files/1636/1636-h/1636-h.htm


Socrates wasn't wrong. Reading a lot gives you a partial understanding but it isn't complete without experiencing the thing for yourself. Arguably the Internet is the natural home of authoritatively stated but uninformed opinions - the exact result of reading a lot about a subject without having any experience of it.


Ironically we can only know he was right by making the exact mistake he was warning us against.


I think the Phaedrus is all about the importance of practice. I've been reading a lot of math books lately, but I don't actually grok anything well until I sit down and try to reason through the material myself, write my own little proofs, try to deconstruct what's being said actively with pen and paper. Similarly, I understand a work of literature far more deeply if I take active notes, and/or write a small essay about my interpretation. I become a better writer by reading good writers and emulating them in my own writing practice. Writing was a threat to poets when the goal was still to recite a compelling live performance, which, to do this well, would require memorization and practice—today, still, we ask that actors do not have paper scripts in front of them when performing in a film.

This is kind of the threat that tools like LLM's pose. Their power to generate decent results means that far more people will eschew practice for "good enough" LLM produced results. Creation will become even more transactional, and (many) people will quickly fail to "see the point" in practicing until we have a culture that's degraded even further than it already has today.


As the hallucinated Euclid said to Ptolemy, there is no LLM for geometry?


Personally, I think the problem is that people abuse Google for things it's really not designed to do, and they don't even realize that.

Google is great at finding official webpages by their exact title. If you type the title of a news headline from the 90s, Google will give you the link to it. I think that is amazing. Basically anything that has a canonical URL, Google is good at finding.

But when you search for "how to do X" for example, there will be several results that are perfectly valid and they will still need to be ranked in a list. Because it's not a "list" of results, it's ranking of relevancy. So to avoid showing spam, Google will push to the top websites it finds trustworthy. And now every top result comes from the same website. If you need an explanation for the Xz incident, for example, there is no canonical URL for it. There will be several news websites, youtube channels, etc. that have talked about it, competing to be the top result.

Google still has to rank them even though the algorithm can't tell fact apart from parody, so no matter what Google does, Google will be the one judging which content most people will read when they want to know about a certain topic.

To borrow my fellow robot's words, people are finding knowledge through an algorithmically curated aperture: Google's SERP.

If they're evil, they have the power to control everyone on Earth. If they're good, they must be going insane with what to do with their users' crippling dependency on them as a source of truth.


I find Google isn't so good anymore for finding things by title; rather than being a search engine they are slowly becoming more like a politician answering reporters' questions, in that instead of returning results based on the terms I asked for, they insist on returning results for the terms they believe I should have asked for.


Can you give me examples? Personally, I don't think I've ever had a problem finding things by their name or title on Google. If I type python, I get python's website, if I type hacker news, I get this website. I can type the name of an old program and get their webpage on sourceforge for example. I've also tested that it works with finding the official websites of celebrities, e.g. if you type eminem, you get his official website.

It does seem to have biases toward recency and product pages, but finding things by their title isn't a thing Google is bad at.


I often search for titles containing very specific acronyms or abbreviations (which I would expect to be high precision low recall); Google's first response lately is usually a search based on a "correction" of these to a more common search, presumably because they are biasing to low precision high recall. (eg band names, even at hamming distance, will shadow maths structures)

Clicking again on my original query, or liberal use of quotation marks, then does the trick, but it's annoying af. Maybe there's a setting somewhere for "show me the results of my query first, not the 'did you mean' first"?

EDIT: upon reflection, the problem is that I use Google in an early XXI manner: if I've been someplace before I use the URL bar, and it's for places I haven't been, that aren't common knowledge, that I want a search engine. There was a time when Google was optimised for this use case, and although that time is long past (in particular, no one buys ads for my queries), the annoyance is recent because now they actively pessimise.


It's also gotten very poor at exact text matches. I often make a bet to myself before searching for certain things as to whether or not Google will show something irrelevant. Most recently I was searching for a specific memory address error crashing ASIO. I was using the exact text from the error popup, and Google was giving me results for... Skyrim Script Extender refusing to start. The only shared text was "Page fault" between the two. I've also noticed this when looking for specific part numbers where instead it will give me magazines by matching the ISSN. Forcing quotes is a dice roll, as sometimes that will just return zero results, or Google will do the exact opposite and broaden the search terms.


Yes, we do kind of.


and i distinctly remember this critique being made at wikipedia's advent, as well.

and it is not without justification! https://undark.org/2021/08/12/wikipedia-has-a-language-probl...


Probably Wikipedia is an early precursor of this problem, yes.

It may not be entirely new, but AI might accelerate it to really harmful speeds.

Just pondering this, of course.


This just means that in-person critical thinking skills will be at an even higher premium than ever.

If knowledge collapse becomes evident, we'll dial back the use if AI, and a lot of "prompt monkey" businesses will go bankrupt.


> we'll dial back the use if AI

Who, and how? This sounds suspiciously like the invisible hand


If using the AI does not economically complete well with those who don't use said AI, then yes it will happen, and i attribute this to the invisible hand of darwinian competition.


I wonder how long natural providences like that hold as we get more and more alien to nature. We already learned we can f-up with global warming. We can fall off the cliff. It's not impossible. Maybe in the olden days when our boldest ideas were still relatively 'natural', we couldn't.


Yeah, that worked really well with cigarettes. People who used em were economically competitive with those who didn't.


LLMs are both language processing engines and knowledge bases. This article explores the knowledge base aspect of LLM and sheds light on the potential danger. The authors are well-justified in doing so because ChatGPT as a knowledge-bot is being used by many end users for its knowledge.

However, to my knowledge, many enterprise applications that are being built using LLMs feed task-specific curated knowledge to LLMs. This mode of LLM use is encouraging. I do not think this article acknowledged this aspect of LLM use.


I don't think this paper goes deep enough, they don't consider for example structured diversity seeding for LLMs, which entails prompting the model to combine diverse entities, knowledge and skills selected at random from a list. Random combinations of conditioning context can lead the model to hyper-diversity. The Skill-Mix paper shows how this can work. https://arxiv.org/abs/2310.17567

LLMs have been trained on everything, when they are also prompted by diverse people with their own tasks and styles it won't lead to knowledge collapse. Two diverse worlds meeting each other in chat LLM interactions - the training corpus and people.

But if you remove the human and just use fixed prompts, and then also narrow the training set with various methods, it might collapse.


That's a good specific example/method, but the model in the paper specifically looks at when individuals will put effort into seeking this kind of diversity.


It sounds like the premise is, we will have fewer specialists and more dabblers. I think it could be true. LLMs are definitely a learning hack as never before could you ask an expert repeated dumb questions and a domain you know nothing about. Eventually you will learn something but it is quite different from being a traditional student and being guided by an expert - learning things that you didn't think to ask questions about or initially were not interested in (but later turn out to be important).


This would be a problem if knowledge were a closed system. In that case it would suffice to invoke the data processing inequality and sooner or later your knowledge is all gone and you’re left with just noise.

However, if AI enables new knowledge to be discovered then maybe there’s no problem in the end?


> I don’t disagree much with the paper’s premise, except that I am turned off whenever there is a handy new tool for us to use and then some people come up with reasons why using the new tool is a harmful thing to do.

I get turned off when people invent new tools.


AI is a clear negative from where I stand.

As of last week, there are two new AI enabled cameras in my building, that are connected directly to the police.

One of my banks recently started using (a completely useless) OpenAI chatgpt, while lying to me that I am talking to a live agent.

When Googling something, I now skip the first page of results altogether, because it's all unendings walls of AI generated garbage.

I can't read Twitter, Reddit and many other sites now, because they want to sell the content to AI training companies themselves.

Up to 20% of my time on the web is now spent solving captchas, that are used to 1) help train AI and 2) to protect from AI scrapers.

AI sends all my useful email to junk folder, so I now have to check two folders instead of one.

Creative people are losing jobs right and left because AI can produce shitty "art" much cheaper. And this shitty "art" is now everywhere, polluting my brain.

AI uses enormous amount of energy and contributes to climate catastrophe.

AI is actively used to kill people in both currently ongoing wars.

Today's Guardian: China will use AI to disrupt elections in the US, South Korea and India, Microsoft warns.


related github repo :

- https://github.com/aristotle-tek/knowledge-collapse "replication code for "AI and the Problem of Knowledge Collapse"


AI is the DaVinci enabler in a complex world


> While large language models are trained on vast amounts of diverse data, they naturally generate output towards the 'center' of the distribution

Yeah... Current gen AI does seem impressive but can't it be said that we have been preparing ourselves to be easily impressed by these "normal" results? The world is more homogenized than ever making it much easier to a bland intelligence to sound intelligent


The discussion section is quite illuminating.

"While much recent attention has been on the problem of LLMs misleadingly presenting fiction as fact (hallucination), this may be less of an issue than the problem of representativeness across a distribution of possible responses. Hallucination of verifiable, concrete facts is often easy to correct for. Yet many real world questions do not have well-defined, verifiably true and false answers. If a user asks, for example, “What causes inflation?” and a LLM answers “monetary policy”, the problem isn’t one of hallucination, but of the failure to reflect the full-distribution of possible answers to the question, or at least provide an overview of the main schools of economic thought."


First thought: Oh no, they want LLMs to be even more vocal about nuance

Second thought: People aren't going to read nuance

Third thought: They should

Fourth thought: Have you met people? They'll get angry with you for even suggesting it


Right. It all hooks up to a naive, popular philosophy of knowledge that I think has been burgeoning since the birth of search engines: people today tend to think that everything is decomposable into atomic facts and that all questions have concrete, singular, atomic answers.

Obviously, this could not be further from the truth, but I think ever since the advent of things like "answer cards" in search, people have been led into this technological bias trap of thinking that whatever answer the engine spits out must be the canonical answer and forget that we still have a broad range of questions and research areas we don't have anything close to definitive answers on. The problem is only exacerbated when people make this mistake in domains that will never have singular answers like politics and ethics.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: