> To test for possible contamination, I tried the same prompts without attaching the sample translations and Claude failed and refused to answer, saying that it is unfamiliar with the Circassian language.
This doesn't indicate that Claude is unfamiliar with Circassian, only that Circassian is sufficiently rare that refusing to answer is a plausible response.
The language is not that obscure in the grand scheme of things, there's a Wikipedia article explaining the grammar https://en.wikipedia.org/wiki/Kabardian_grammar which is definitely in Claude's training set, probably alongside a few hundred linguistics papers and a bunch of monolingual data.
If you measured the performance for different numbers of initial translation examples, I suspect that there will be a sudden jump at the point where Claude stops refusing to even try, and after that additional examples will only marginally improve the output.
If all those resources (Wikipedia article and papers) exist, then they are surely in GPT-4's training set as well. So clearly there is a difference between the capabilities of Claude and GPT-4.
If this were correct, it would be a bit less impressive than OP’s claim, but still a monumental leap forward for translation of low-resource languages - a task which all previous LLMs fail at.
need to show this to people claiming "it's just statistical inference, these models can't demonstrate any '''understanding'''" as if understanding is proven to be something else. these people assume every single bit of intelligence these models show is somehow in their training set which is patently false and very easy to test that it is false.
first week ChatGPT was public more than a year ago, I tried the early model to make it play along with me inventing a new programming language with novel attributes in syntax. after some back and forth, it could translate my javascript samples to the new programming language paying attention to the new language's semantics, and could even simulate running simple pieces of code. Sure, it had some token errors here and there but it was working. It was understanding what I was telling it, and responding in kind.
Over the months since, things only got better. So I'm not surprised with the results of this post, but still astonished the same.
> these people assume every single bit of intelligence these models show is somehow in their training set which is patently false and very easy to test that it is false.
Interesting, please create the very easy test that proves that.
one was already provided in my comment. ask it to create something by prompting the idea that only exists in your imagination, set up the constraints with natural language, and guide it to explore the dynamics of it with the context and constraints you just provided. you can see before your very eyes that it can interact with a context and set of constraints you just pulled out of thin air, something that exists nowhere but its context window.
I mean even being able to find the salient points and summarize an article that was written today (not in the training set) is intelligence, but some would move goal posts thinking it is something so simple, and from your tone I assumed you'd claim it as such. so I provided a hopefully more foolproof version of testing emergent intelligence that extends beyond training data.
This doesn't indicate that Claude is unfamiliar with Circassian, only that Circassian is sufficiently rare that refusing to answer is a plausible response.
The language is not that obscure in the grand scheme of things, there's a Wikipedia article explaining the grammar https://en.wikipedia.org/wiki/Kabardian_grammar which is definitely in Claude's training set, probably alongside a few hundred linguistics papers and a bunch of monolingual data.
If you measured the performance for different numbers of initial translation examples, I suspect that there will be a sudden jump at the point where Claude stops refusing to even try, and after that additional examples will only marginally improve the output.