I would like to see this expanded, I think it's a bit unfair to assess its abilities with so few examples. My hypothesis is that a rosetta stone with a thousand examples with a vector database hooked up to it so you don't hit the 32k token context limit would lead to much better performance.
We'd love to see that too! However, I'm afraid that creating a substantial number of examples would transform this delightful family activity into something akin to punishment. Kłeti is quite the challenge for us Indo-Europeans, and it seems that even its creator isn't immune to the struggle.
Both GPT-3.5 and GPT-4 versions of ChatGPT are limited to 4k tokens, even though GPT-4 is capable of 32k.
This leads me to believe that part of the reason for some of the mediocre results OP saw was because they hit the token limit and ChatGPT started "forgetting" earlier parts of the conversation.
No, I was explicitly watching for this. In one of the sessions where we asked it to generate Kłeti sentences and the conversation passed the token limit it started inserting characters like ı (the Turkish dotless i). A week earlier I was playing with interpreting go positions, and at some point the model switched to talking about Chess (a bit less subtle than inserting unusual characters).
GPT-4 allows you to use 8k of context in their current beta, if you're using the chat api directly. It will be interesting ( and probably expensive, lol ) when they open it to a full 32k.
I'm really looking forward to being able to use a personalized LoRa on top of a GPT-4+ class model. I want to be able to train on all of may writing over the past few decades and interrogate the history of my ideas, and I think this would be tremendously valuable for writers of all kinds. Heck, think of the value of training (with their blessing) on something like /r/AskHistorians, or other deep-dive, high quality fora.
The vector database would be good for retrieving vocabulary, but could it be expected to do things like retrieve sentences with similar syntax or tenses? It feels like it would be hard to successfully retrieve examples that were important for reasons other than semantic content.