If by “understand” you mean “can model reasonably accurately much of the time” then maybe you’ll find consensus. But that’s not a universal definition of “understand”.
For example, if I asked you whether you “understand” ballistic flight, and you produced a table that you interpolate from instead of a quadratic, then I would not feel that you understand it, even though you can kinda sorta model it.
And even if you do, if you didn’t produce the universal gravitation formula, I would still wonder how “deeply” you understand. So it’s not like “understand” is a binary I suppose.
Well what would you need to see to prove understanding? That's the metric here. Both the LLM and the human brain are black boxes. But we claim the human brain understands things while the LLM does not.
Thus what output would you expect for either of these boxes to demonstrate true understanding to your question?
It is interesting that you are demanding a metric here, as yours appears to be like duck typing: in effect, if it quacks like a human...
Defining "understanding" is difficult (epistemology struggles with the apparently simpler task of defining knowledge), but if I saw a dialogue between two LLMs figuring out something about the external world that they did not initially have much to say about, I would find that pretty convincing.
This is a common misunderstanding, one also seen with regard to definitions. When applied to knowledge acquisition, it suffers from a fairly obvious bootstrapping problem, which goes away when you realize that metrics and definitions are rewritten and refined as our knowledge increases. Just look at what has happened to concepts of matter and energy over the last century or so.
You are free to disagree with this, but I feel your metric for understanding resembles the Turing test, while the sort of thing I have proposed here, which involves AIs interacting with each other, is a refinement that makes a step away from defining understanding and intelligence as being just whatever human judges recognize as such (it still depends on human judgement, but I think one could analyze the sort of dialogue I am envisioning more objectively than in a Turing test.)
No it's not a misunderstanding. Without a concrete definition on a metric comparisons are impossible because everything is based off of wishy washy conjectures on vague and fuzzy concepts. Hard metrics bring in quantitative data. It shows hard differences.
Even if the metric is some side marker where in the future is found to have poor correlation or causation with the the thing being measured the hard metric is still valid.
Take IQ. We assume iq measures intelligence. But in the future we may determine that no it doesn't measure intelligence well. That doesn't change the fact that iq tests still measured something. The score still says something definitive.
My test is similar to the Turing test. But so is yours. In the end there's a human in the loop making a judgment call.
This is rather self-contradictory: you insist we can't make progress with wishy-washy conjectures on vague and fuzzy concepts, and yet your entire argument in this thread for your claim that machine understanding of the real world has been achieved is based on exactly that: your personal subjective assessment of LLM performance!
In your final paragraph, you attempt to suggest that my proposed test is no better than the Turing test (and therefore no better than what you are doing), but as you have not addressed the ways in which my proposal differs from the Turing test, I regard this as merely waffling on the issue. In practice, it is not so easy to come up with tests for whether a human understands an issue (as opposed to having merely committed a bunch of related propositions to memory) and I am trying to capture the ways in which we can make that call.
You entered this debate saying "I think we are way past the point of debate here. LLMs are not stochastic parrots. LLMs do understand an aspect of reality", yet your post here ends with "in the end there's a human in the loop making a judgment call", explicitly acknowledging that your strong initial claims are matters of opinion, rather than established facts supported by hard metrics.
>This is rather self-contradictory: you insist we can't make progress with wishy-washy conjectures on vague and fuzzy concepts, and yet your entire argument in this thread for your claim that machine understanding of the real world has been achieved is based on exactly that: your personal subjective assessment of LLM performance!
No it's not. I based my argument on a concrete metric. Human behavior. Human input and output.
> I regard this as merely waffling on the issue.
No offense intended but I disagree. There is a difference but that difference is trivial to me. To LLMs talking is also unpredictable. LLMs aren't machines directed to specifically generate creative ideas, they only do so when prompted. Left to its own devices to generate random text does not necessarily lead to new ideas. You need to funnel got in the right direction.
>You entered this debate saying "I think we are way past the point of debate here. LLMs are not stochastic parrots. LLMs do understand an aspect of reality", yet your post here ends with "in the end there's a human in the loop making a judgment call", explicitly acknowledging that your strong initial claims are matters of opinion, rather than established facts supported by hard metrics.
There are thousands of quantitative metrics. LLMs perform especially well on these. Do I refer to one specifically? No. I refer to them all collectively.
I also think you misunderstood. Your idea is about judging an whether an idea is creative or not. That's too wishy washy. My idea is to compare the output to human output and see if there is a recognizable difference. The second idea can easily be put into an experimental quantitative metric in the exact same way the Turing test does it. In fact, like you said it's basically just a Turing test.
Overall AI has passed the Turing test but people are unsatisfied. Basically they need to just make a harsher Turing test to be convinced. For example have people directly know the possibility that the thing inside a computer is possibly an LLM and not a person and have the person directly investigate to uncover the true identity. If the LLM can successfully decieve the human consistently then that is literally the final bar for me..
What are these "thousands of quantitative metrics" on which you base your latest claims? If you have had them on hand all this while, it seems odd that you have not made use of them so far.
>What are these "thousands of quantitative metrics" on which you base your latest claims? If you have had them on hand all this while, it seems odd that you have not made use of them so far.
Hey no offense but I don't appreciate this style of commenting where you say it's "odd." I'm not trying to hide evidence from you and I'm not intentionally lying or making things up in order to win an argument here. I thought of this as a amicable debate. Next time if you just ask for the metric rather then say it's "odd" that I don't present it that would be more appreciated.
I didn't present evidence because I thought it was obvious. How are LLMs compared with one another in terms of performance? Usually those are done with quantitative tests. You can feed any number of these tests including stuff like the SAT, BAR, ACT, IQ, SATII etc.
Most of these tests aren't enough though as the LLM is remarkably close to human behavior and can do comparably well and even better than most humans. I mean that last statement I made would usually make you think that those tests are enough, but they aren't because humans can still detect whether or not the thing is an LLM with a longer targetted conversation.
The final run is really giving the human with full knowledge of his task a full hour of investigating an LLM to decide whether it's human or a robot. If the LLM can deceive the human that is a hard True/False quantitative metric. That's really the only type of quantitative test left where there is a detectable difference.
I had no intention of implying any malfeasance in my use of the word "odd"; I mean it in the sense of unusual, unexpected and surprising. The thing is, you finishished your precursor post saying, about your tests and mine, that it comes down to there being a human in the loop making a judgement call, but in a follow-on you say that there are thousands of quantitative metrics. Why, I wondered, would that matter, if it comes down to a human making a judgement call? Were you switching to a different line of argument, one that (as far as I could tell) had not been raised before? That's what I found surprising about your claim.
I am still rather confused about how this fits into what you are saying more generally. At first I thought you were saying, in your latest post, that the Turing-test interrogator should be restricted to asking questions from the sets having quantitative metrics in order for it to be an objective process, but that doesn't really hold up, as far as I can see. Frankly, I suspect that the tests with objective metrics are beside the point, and the essence of your position is contained within your final paragraph: "If the LLM can deceive the human [then] that is a hard True/False quantitative metric [and the only sort we can get]."
If so, then (no surprise) I think there are some problems with it, but before I go further, I would like to check that I understand your position.
>I had no intention of implying any malfeasance in my use of the word "odd"; I mean it in the sense of unusual, unexpected and surprising. The thing is, you finishished your precursor post saying, about your tests and mine, that it comes down to there being a human in the loop making a judgement call, but in a follow-on you say that there are thousands of quantitative metrics. Why, I wondered, would that matter, if it comes down to a human making a judgement call? Were you switching to a different line of argument, one that (as far as I could tell) had not been raised before? That's what I found surprising about your claim.
It matters because of humans. If I gave an LLM thousands of quantitative tests and it passed them all but in an hour long conversation a human could identify it was an LLM through some flaw the human would consider all those tests useless. That's why it matters. The human making a judgement call is still a quantitative measurement btw as you can limit human output to True or False. But because every human is different in order to get good numbers you have to do measurements with multitudes of humans.
>I am still rather confused about how this fits into what you are saying more generally. At first I thought you were saying, in your latest post, that the Turing-test interrogator should be restricted to asking questions from the sets having quantitative metrics in order for it to be an objective process, but that doesn't really hold up, as far as I can see.
it can still be objective with a human in the loop assuming the human is honest. What's not objective is a human offering an opinion in the form of a paragraph with no definitive clarity on what constitutes a metric. I realize that elements of MY metric have indeterminism to it, but it is still a hard metric because the output is over a well defined set. Whenever you have indeterminism you would then turn to probability and many samples in order to produce a final quantitative result.
>If so, then (no surprise) I think there are some problems with it, but before I go further, I would like to check that I understand your position.
yes my position is that exactly. If all observable qualities indicate it's a duck, then there's nothing more you can determine beyond that, scientifically speaking. You're implying there is a better way?
At this point, I think it is worth refreshing what the issue here is, which is whether LLMs understand that the language they receive is about an external world, which operates through causes which have nothing to do with token-combination statistics of the language itself.
> It matters because of humans...
I'm still a bit puzzled here, because it seems to me that the paragraph continuing from here is making the argument that LLM performance on these tests doesn't matter, as far as the question is concerned: in this paragraph you seem to be saying (paraphrased) that despite LLMs' impressive performance on these quantitative tests, they could still fail Turing tests, so their performance on these quantitative tests is not decisive.
> yes my position is that exactly…
The impression I get from what you have written in this post is that you are not claiming that a test conforming to your requirements has actually been successfully performed, you are just assuming it could be?
Regardless, let’s assume (at least for the sake of argument) that the series of tests you propose have been performed, and the results are in: in the test environment, humans can’t distinguish current LLMs from humans any better than by chance. How do you get from that to answering the question we are actually interested in? The experiment does not explicitly address it. You might want to say something like “The Turing test has shown that the machines are as intelligent as humans so, like humans, these machines must realize that the language they receive is about an external world” but even the antecedent of that sentence is an interpretation that goes beyond what would have objectively been demonstrated by the Turing test, and the consequent is a subjective opinion that would not be entailed by the antecedent even if it were unassailable. Do you have a way to go from a successful Turing test to answering the question here, which meets your own quantitative and objective standards?
>I'm still a bit puzzled here, because it seems to me that the paragraph continuing from here is making the argument that LLM performance on these tests doesn't matter, as far as the question is concerned: in this paragraph you seem to be saying (paraphrased) that despite LLMs' impressive performance on these quantitative tests, they could still fail Turing tests, so their performance on these quantitative tests is not decisive.
It matters in the quantitative sense. It measures AI performance. What it won't do is matter to YOU. Because you're a human and humans will keep moving the bar to a higher standard right? When AI shot passed the turing test humans just moved the goal posts. So to convince someone like YOU we have to look at the final metric. The point where LLM I/O becomes indistinguishable/superior to humans. Of course you look at the last decade... AI is rapidly approaching that final bar.
>The impression I get from what you have written in this post is that you are not claiming that a test conforming to your requirements has actually been successfully performed, you are just assuming it could be?
Whether I assume or don't assume, the projection of the trendline currently indicates that it will. Given the trendline that is the most probable conclusion.
>The experiment does not explicitly address it.
Nothing on the face of the earth can address the question. Because nobody truly knows what "understanding" something actually is. You can't even articulate the definition in a formal way such that it can be dictated on a computer program.
So I went to the next best possibility, which is my point. The point is ALTHOUGH we don't know what understanding is, we ALL assume humans understand things. So we set that as a bar metric. Anything indistinguishable from a human must understand things. Anything that appears close to a human but is not quite human must understand things ALMOST as well as a human.
> What it won't do is matter to YOU. Because you're a human and humans will keep moving the bar to a higher standard right? When AI shot passed the turing test humans just moved the goal posts. So to convince someone like YOU we have to look at the final metric.
It is disappointing to see you descending into something of a rant here. If you knew me better, you would know that I spend more time debating in opposition to people who think they can prove that AGI/artificial consciousness is impossible than I do with people who think it is already an undeniable fact that it has already been achieved (though this discussion is shifting the balance towards the middle, if only briefly.) Just because I approach arguments in either direction with a degree of skepticism and I don't see any value in trying to call the arrival of true AGI at the very first moment it occurs, it does not mean that I'm trying (whether secretly or openly) to deny that it is possible either in the near-term or at all. FWIW, I regard the former as possible and the latter highly probable, so long as we don't self-destruct first.
> Nothing on the face of the earth can address the question. Because nobody truly knows what "understanding" something actually is. You can't even articulate the definition in a formal way such that it can be dictated on a computer program.
The anti-AI folk I mentioned above would willingly embrace this position! They would say that it shows that human-like intelligence and consciousness lies outside of the scope of the physical sciences, and that this creates the possibility of a type of p-zombie that is indistinguishable by physical science from a human and yet lacks any concept of itself as an entity within an external world.
More relevantly, your response here repeats an earlier fallacy. In practice, concepts and their definitions are revised, tightened, remixed and refined as we inquire into them and gain knowledge. I know you don't agree, but as this is not an opinion but an empirical observation, validated by many cases in the history of science and science-like disciplines, I don't see you prevailing here - and there's the knowledge-bootstrap problem if this were not the case, as well.
It occurred to me this morning that there's a variant or extension of the quantitative Turing test which goes like this:
We have two agents and a judge. The judge is a human and the agents are either a pair of humans, a pair of AIs, or one of each, chosen randomly and without the judge being unaware of the mix. One of the agents is picked, by random choice, to start a discussion with the other with the intent of exploring what the other understands about some topic, with the discussion-starter being given the freedom to choose the topic. The discussion proceeds for a reasonable length of time - let's say one hour.
The judge follows the discussion but does not participate in it. At the conclusion of the discussion, the judge is required to say, for each agent, whether it is more likely that it is a human or AI, and the accuracy of this call is used to assign a categorical variable to the result, just as in the version of the Turing test you have described.
This seems just as quantitative, and in the same way, as your version, yet there's no reason to believe it will necessarily yield the same results. More tests are better, so what's not to like?
>It is disappointing to see you descending into something of a rant here.
I'm going to be frank with you. I'm not ranting and uncharitable comments like this aren't appreciated. I'm going to respond to your reply later in another post, but if I see more stuff like this I'll stop stop communicating with you. Please don't say stuff like that.
I could have, equally reasonably, made exactly the same response to your post. I will do my best to respond civilly (I admit that I have some failings in this regard), but I also suggest that whenever you feel the urge to capitalize the word "you", you give it a second thought.
Apologies, by YOU I mean YOU as a human, not YOU as an individual. Like we all generally feel that the quantitative tests aren't enough. The capitalization was for emphasis for you to look at yourself and know that you're human and likely feel the same thing. Most people would say the stuff like IQ tests aren't enough and we can't pinpoint definitively why, as humans, WE (keyword change) just feel that way.
That feeling is what sets the bar. There's no rhyme or reason behind it. But humans are the one who make the judgement call so that's what it has to be.
For your test I don't see it offering anything new. I see it as the same as my test but just extra complexities. From a statistical point of view I feel it will yield roughly the same results as my test. As long as the judge outputs a binary true or false on whether the entities are humans or ais.
Yes I did say we can't define understanding. But despite the fact that we can't define it we still counter intuitively "know" when something has the capability of understanding. We say all humans have the capability of understanding.
This is the point. The word is undefined yet we can still apply the word and use the word and "know" whether something can understand things.
Thus we classify humans as capable of understanding things without any rhyme or reason. This is fine. But if you take this logic further, that means anything that is indistinguishable from a human must fit into this category.
That was my point. This is the logical limit of how far we can go with an undefined word. To be consistent with our logical application of the word "understanding" we must apply to AI if AI is indistinguishable from humans. If we don't do this then our reasoning is inconsistent. All of this can be done without even having a definition of the word "understanding"
I think it may be helpful for me to say some more about how I came to my current positions.
Firstly, there have been a number of attempts to teach language to other animals, and also a persistent speculation that the complex vocalizations of bottlenose dolphins is a language. There is no consensus, however, on what to make of the results of the investigations, with different people offering widely disparate views as to the extent that these animals have, or have acquired language.
My take on these studies is that their language abilities are very limited at best, because they don't seem to grasp the power of language. They rarely initiate conversations, especially outside of a testing environment, and the conversations they do have are perfunctory. In the case of dolphins, if they had a well-developed language of their own, it seems unlikely that those being studied would fail to recognize that the humans they interact with have language themselves, and cooperate with the attempts of humans to establish communication, as this would have considerable benefit, such as being able to negotiate with the humans who exercise considerable control over their lives.
From these considerations, it seems to me that unless and until we see animals initiating meaningful conversations, especially between themselves without human prompting, it is pretty clear that their language skills do not match those of adult humans. This is what led me to see the value of a form of Turing test in which the test subjects demonstrate that they can initiate and sustain conversations.
A second consideration is that while human brains and minds are largely black boxes, we know a great deal about LLMs: humans designed them, they work as designed, and while they are not entirely deterministic, their stochastic aspect does not make their operation puzzling. We also know what they gain from their training: it is statistical information about token combinations in human language as it is actually used in the wild. It is not obvious that, from this, any entity could deduce that these token sequences often represent an external world that operates according to causes which are independent of what is said about the situation. An LLM is like a brain in a vat which only receives information in the form of a string of abstract tokens, without anything else to correlate it with, and it is incapable of interacting with the world to see how it responds.
From these considerations, therefore, it seems possible that, if LLMs understand anything, it is at most the structure of language as it is spoken or written, without being aware of an external world. I can't prove that this is so, but for the purpose of the arguments in this thread, and specifically the one in the first post that you replied to, all I need is that it is not ruled out.
Turning now to your latest post:
> For your test I don't see it offering anything new.
It is far from obvious that it will necessarily produce the same results as your test, and you have presented no argument that it will. If we are in the situation where one of these tests can discriminate between the candidate AIs and humans, then the only rational conclusion is that these candidate AIs can be distinguished from humans, even if the other test fails to do so.
> From a statistical point of view I feel it will yield roughly the same results as my test.
Throughout these conversations with me and other people, you have insisted that only quantitative tests are rigorous enough, but now you are arguing from nothing more than your opinion as to what the outcome would be. An opinion about what the quantitative results might be is not itself a quantitative result, and while you might be comfortable with the inconsistency of your position here, you can't expect the rest of us to agree.
> But despite the fact that we can't define [understanding] we still counter intuitively "know" when something has the capability of understanding. We say all humans have the capability of understanding... the word is undefined yet we can still apply the word and use the word and "know" whether something can understand things.
Good! This is a complete reversal from when you were arguing that understanding was not a valid concern unless it were rigorously defined.
> Thus we classify humans as capable of understanding things without any rhyme or reason. [my emphasis.]
If it were truly without rhyme or reason, 'understanding' would be an incoherent concept - a misconception or illusion. Fortunately, there is a rigorous way for handling this sort of thing: we can run a series of Turing-like tests, or simply one-on-one conversations, but only with human subjects, with multiple interrogators examining the same set of people and judging the extent to which they understand various test concepts. The degree of correlation between the outcomes will show us how coherent a concept it is, and the transcripts of the tests can be examined to begin the iterative process of defining what it is about the candidates that allows the judges to produce correlated judgements.
Once we have that in place, we can start adding AIs to the mix, confident that they are being judged by the same criteria as humans.
> But if you take this logic further, that means anything that is indistinguishable from a human must fit into this category.
Certainly not if the test is incapable of finding the distinction. The process I outlined above would be able to make the distinction, unless 'understanding' is not a coherent concept (but we seem to agree that it probably is.) Furthermore, as I pointed out above, one test capable of consistently making a distinction is all it takes.
The author of the post is saying that understanding something can't be defined because we can't even know how the human brain works. It is a black box.
The author is saying at best you can only set benchmark comparisons. We just assume all humans have the capability of understanding without even really defining the meaning of understanding. And if a machine can mimic human behavior to it must also understand.
That is literally how far we can go from a logical standpoint. It's the furthest we can go in terms of classifying things as either capable of understanding or not capable or close.
What you're not seeing is the LLM is not only mimicking human output to a high degree. It can even produce output that is superior to what humans can produce.
What the author of the post actually said - and I am quoting, to make it clear that I'm not putting my spin on someone else's opinion - was "There's no difference between doing something that works without understanding and doing the exact same thing with understanding."
I'm the author. To be clear. I referred to myself as "the author."
And no I did not say that. Let me be clear I did not say that there is "no difference". I said whether there is or isn't a difference we can't fully know because we can't define or know about what "understanding" is. At best we can only observe external reactions to input.
That was just about guaranteed to cause confusion, as in my reply to solarhexes, I had explicitly picked out "the author of the post to which you are replying", who is cultureswitch, not you, and that post most definitely did make the claim that "there's no difference between doing something that works without understanding and doing the exact same thing with understanding."
It does not seem that cultureswitch is an alias you are using, but even if it is, the above is unambiguously the claim I am referring to here, and no other.
I think there are two axes: reason about and intuit. I "understand" ballistic flight when I can calculate a solution that puts an artillery round on target. I also "understand" ballistic flight when I make a free throw with a basketball.
On writing that, I have an instinct to revise it to move the locus of understanding in the first example to the people who calculated the ballistic tables, based on physics first-principles. That would be more accurate, but my mistake highlights something interesting: an artillery officer / spotter simultaneously uses both. Is theirs a "deeper" / "truer" understanding? I don't think it is. I don't know what I think that means, for humans or AI.
I think it’s just that’s the word you’ve been taught to use. It’s divorced from the meaning of its constituent parts, you aren’t saying “an American of African descent” you’re saying “black” but in what was supposed to be some kind of politically correct way.
I cannot imagine even the most daft American using it in the UK and intending that the person is actually American.
OTOH you might be able to see the dark circle as it streaks across the ground, something you can't see from most places at ground level. (Maybe on a mountain I guess)
For example, if I asked you whether you “understand” ballistic flight, and you produced a table that you interpolate from instead of a quadratic, then I would not feel that you understand it, even though you can kinda sorta model it.
And even if you do, if you didn’t produce the universal gravitation formula, I would still wonder how “deeply” you understand. So it’s not like “understand” is a binary I suppose.