Maybe I misunderstood your comment? Also I may have been unclear in my reply.
I understood your specific Greek dictionary example as being something that does not encode any meaning, because the words are not real Greek. However, completely hypothetically, if "Αβαρκαμαξας" was a word that a whole bunch of Greek people started repeatedly using in the same or similar contexts, it would quite literally become a real word.
That is what I mean when I say meaning is entirely statistical. Words and phrases emerge, disappear, and change meaning over time. These changes often happen differently within different segments of populations that speak the same language. What words mean depends entirely on enough people agreeing to use the words the same way. LLMs capture the commonly used modern meanings of words extremely accurately.
edit: That's also how ChatGPT was able to understand the commonly made typo in the question I asked it. Statistically, I probably meant to say "can you tell me what the word [...] means".
What I meant with the example I gave was that you can't figure out what a word means just by looking at what other words it's close to, unless you know what all those words mean, already. That is relevant to my comment above where I wrote:
>> Language models don't encode meaning. They "encode" the probability of collocations between lexical tokens. Such collocations are correlated with meaning, but they do not "encode" meaning.
So I gave an example of some words, that I suggested were collocated, to demonstrate that you can't tell what they mean just by looking at them next to each other.
That the words don't really exist is something that you wouldn't know unless I had told you, or unless you had access to a dictionary. Or, I guess, a Greek speaker. I told you that the words are not real Greek not to make a point, but to avoid being disingenuous [1].
The point of all this jumping around is that if you don't know what the words mean, you won't know what the words mean, no matter how many statistics you take on a corpus, no matter how large.
This is actually related to your point. I don't think that "language is statistical". I do agree that language, and the meaning of words and sentences, changes constantly. But, that's one more reason why language models are such a poor, well, model of language: in ten years from now, there will be many expressions used in everyday language that ChatGPT will not be able to recognise, because they would not have been in its training corpus. What kind of model of language is one that's frozen in time, and can't keep up with the vicissitudes of language as it evolves spoken by its speakers over time?
A very poor model of language, indeed.
______________
[1] You put the words in ChatGPT and it told you they mean nothing. That's not entirely true. There is a pair of words in the explanation part of my sentence: "μαύρα φούμαρα", which are real Greek words and that form a meaningful Greek phrase. So I was disingenuous, a little bit, or depending on how you see it: I said that the "explanation" is not real Greek, not all its words. In my defense, I was sure you'd put the phrase in Google Translate, or, if I told you it means nothing, to ChatGPT. So I wanted to double-blind this a bit. Apologies for leading you down the garden path. But, look at all the flowers!
In any case, ChatGPT didn't recognise these two words and the phrase they form and gave you a generic answer, riffing off the information you gave it that the sentence was a "dictionary entry". So you told it it's a dictionary entry and it said it's a dictionary entry and it doesn't know what it means. That doesn't mean that it knows that it doesn't know what it means. It just says the most likely thing, which just happens to be "it's all Greek to me", basically. It could just as well have made up some meaning, because meaning means nothing to it and all the words are the same to it, except for their location in its model.
Come to think of it, I suspect the prompt you gave ChatGPT, and particularly the bit about "it's OK if you don't know", is a strong bias that leads to it saying "I don't know", rather than pretending it knows. Can you try again, without that part?
I understood your specific Greek dictionary example as being something that does not encode any meaning, because the words are not real Greek. However, completely hypothetically, if "Αβαρκαμαξας" was a word that a whole bunch of Greek people started repeatedly using in the same or similar contexts, it would quite literally become a real word.
That is what I mean when I say meaning is entirely statistical. Words and phrases emerge, disappear, and change meaning over time. These changes often happen differently within different segments of populations that speak the same language. What words mean depends entirely on enough people agreeing to use the words the same way. LLMs capture the commonly used modern meanings of words extremely accurately.
edit: That's also how ChatGPT was able to understand the commonly made typo in the question I asked it. Statistically, I probably meant to say "can you tell me what the word [...] means".