This game is well known in the UK as the "Connecting Wall" from Only Connect.
This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well. Perhaps it would need a model to be specifically trained for this task. If alpha-zero can master chess, then surely this game isn't beyond what is trainable.
You can prompt Chat GPT that it'll be playing the connecting wall without having to explain the game. It still fails to make a good set of connections when provided the wall.
One interesting part of the "Connecting Wall" sets is that there is almost always a "Wordy one" involving changing a letter, adding a prefix, anagrams, etc. Almost always a "Person" one for example there'll be a set of "Famous people named Tom..." but not a set of "Toms" with a set of "Margarets", and then a couple of general sets.
This is a huge help given the 2 minutes and 30 seconds provided.
On another note, it's possible that the GCHQ puzzle book would be in the training set, which has many puzzles with solutions for this format and a very similar rubrik with 55 items and sets of sizes 1 through 10. That said, Chat GPT perhaps would not tie the answers in the back of the book to the solutions in the front.
I all, I think an AI trained for this purpose with problems and given solutions ought to end up mastering this format. But a general purpose chat GPT seems like it performs very badly.
> This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well
I would speculate it’s struggling because of the linear nature of its output, and the red-herring words which crossover between categories.
Because the model can’t “look ahead”, it starts spitting out valid combinations, but without being able to anticipate that committing to a certain combination early on will lead to a mistake later.
I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.
> I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.
I had a similar idea to the author and tried this many times, albeit with the free version of ChatGPT. After getting wrong results, I prompted it to correct them, even telling the model explicitly that a category is wrong or doesn't make sense. Nothing I did made a difference.
My two cents on why this doesn't work has to do with the fact that the answer should contain a discrete set of words given in the prompt, and importantly, they should not be duplicated. I suspect that these currents models are not very good at following the instruction "the token should appear in the answer exactly once"
> Because the model can’t “look ahead”, it starts spitting out valid combinations, but without being able to anticipate that committing to a certain combination early on will lead to a mistake later.
Aren't there already models that CAN look ahead? Or are there none?
Can we infer anything about what llm's can achieve from what we can achieve with AIs like AlphaGo? I thought their approaches were completely separated
Gpts are a class of text predictors. Ultimately they are ranked on whether or not the output is similar to the training data, text-wise. If the training data included a game then it may be able to play that game, but only if that game requires reasoning about entire words (because of tokenization, gpts can't reason in terms of letters, that's why they do poorly at crosswords for example)
On the flip side, alphazero is a class of networks that have a list of actions they can take, and a list of parameters they observe about the game (in chess: the board position, in other games: their position on screen, score, speed, etc). The model is then trained to take actions that maximize an actual hard value from the game, like winning a game of chess, capturing a piece, increasing a score, driving the furthest.
In theory you could train a model with the alphago method to do text prediction, but LLMs are called "large" for a reason, the input and output spaces would have to be the number of possible tokens (and at that point just train a normal gpt, it's much more efficient). Also in theory you could train a gpt to play games, but you're spending huge amounts of compute evaluating extraneous words in the input (the prompt) and the output (most words do not have anything to do with your game). on top of that, you're iterating over every word you generate to generate the next one, so you're doing multiple passes of this largely infficient computing, which means you're slower compared to a tailor-made model that can evaluate one situation once and give you a list of outputs to perform.
in this specific case it's a bit wierd because the input space for the alphazero model would have to be every word that can appear on the board, but the reasoning part is most likely not a problem given enough model size. since it's competing with a multi-gigabyte llm though, there is space to spare.
I've certainly thought about testing LLMs on Connections and I'm glad someone has. It might be possible to increase their performance, but LLMs as-is are not suited for the task.
The problem is that Connections is ultimately a search problem that requires more than simply grouping similar words. There are lots of combinations to assess. I bet if you enumerate, score, then rank all possible groupings, an LLM would perform much better.
Doesn't this list the words in the order that they are grouped? The article states that randomizing the words completely eliminates any successful results
Apart from the "it just explained the already ordered groups in the question" problem, it didn't even explain one of the groups correctly. "Something about coat(ing) and food" is not the correct explanation, it's missing a lateral logic step there to go from food-related to a separate meaning.
This is pretty interesting. Intuitively, Connections is the kind of thing I would expect GPT to not be good at, because almost every day there's something that feels kind of "out of left field" in the categories. In my experience LLMs are good at regurgitating the "standard" take on a topic, or "best practices", but lack the creativity and out-of-the-box thinking that makes Connections fun.
On the other hand, it feels like the kind of thing where an LLM might be surprisingly good, because it could, in theory, be able to see more correlations than a human can. Based on these results I guess my intuition seems to hold up.
I wonder if a better / different way to approach this could be more "algorithmically" - maybe have the LLM generate a list of possible categories for each individual word and then try to operate on those associations?
The "whole point" of embeddings is that words have a vector that represents how well that word fits into a certain categories, so words belonging together is close in that vector space. So in that sense it almost feels like this should be solvable using something simpler than a full LLM. To "just" get the embeddings of the words, and then find the groups of 4 that minimizes the total distances within the groups.
The problem is Connections is designed to use a tons of alternate definitions and other vaguities that aren’t well modeled in typical embeddings. Today’s for instance (spoilers!!) has Coat, Green, Pod, and Soup as being linked for them matching “Pea ___”. No embedding would relate them at all, unless that suffix is known a priori.
I unfortunately can’t imagine having time to test this, but I imagine there may be a way to accomplish this with embeddings.
The game itself is sort of an embeddings clustering problem, with the added difficulty that each group needs to only be alike in 1 way (versus a full vector distance which measures how alike they are in every way).
Maybe there is some way to search for a vector of weights, which, when multiplied by all members of a group of 4, produces weighted vectors with the least distance from their center? And then it’s an optimization problem to find the 4 groups that find minimize the total distance from each groups center?
It may be possible to find a weight vector that selects for a particular slice of a words meaning
That approach works well for a game like [Codewords](https://en.wikipedia.org/wiki/Codenames_(board_game)) where you're trying to find a single-word common hint between many of your words (that doesn't hit any of the other words).
My feeling is that it'll struggle with word-plays in OnlyConnect/Connections (like missing letters, added letters, words-within-words homophones, etc) as well as two-step references (such as {Venice, Dream, Night, Nothing} => "last words of Shakespeare plays"}).
I thought it would. But I've spent a fair bit of effort both using embeddings and also using prompts to GPT4, as well as combinations of the two approaches, to try to make a good spymaster for Codenames with essentially zero success.
I was playing a bit with embeddings in 2021. I'd played codenames online with friends in lockdown and we often had interesting boards we'd talk about, so when I saw papers like this (https://arxiv.org/abs/2105.05885) I looked into the topic. I found the suggested clues were very good, and there were some 'clue scoring' functions which correlated with the actual best spymasters. Wasn't scientifically rigorous as OPs post, but I would say it was good.
I tried with clustering similar embeddings but it did extremely poorly (~0%) since the groupings are often deceiving with words in a group only having one small way in which they're connected and lots of spurious fake groups to throw you off. Maybe looking for groups with high similarity on only a sibset of embedding dimensions might help, but I didn't have much time to play either :) A notebook to get you going if you do want to play: https://colab.research.google.com/drive/1KJeSB9Q5XzSeT9ONUJ_...
You need to model how a person actually plays connections. Start with the most obvious group that has the least ambiguity, and then your problem space is smaller on category 2, then the same for category 3 and 4.
So really you could fine tune 3 models - one for 16 words, one for 12, and one for 8. Then use them in succession.
Also, if you come across a mistake at the end (have some negative examples in the training sets), tell it to start over and add to the prompt what you think is NOT a group.
Connections is deliberately written so that any one word might belong to multiple groups. For example, the word "Bass" might be surrounded by "Guitar," "Drums," and "Microphone," but actually belongs to a category of "Fish," while "Guitar" might belong to the category "Air ___," and "Microphone" might belong to "Something that can be dropped."
Just making up that example, but it's very common that multiple words will all appear to be one group, and actually each one belongs to a different group.
FWIW I was able to get about 20% accuracy (perfect 4/4 groups) with ~50% of groups correct on average and most mistakes being groups with 3/4 right (so at least on the right track) with my first attempt at 0-shot prompting. The prompt goes something like this:
```
You are playing [game info] with [word list]
Follow these steps:
- consider possible groupings as initial brainstorming (>4 groups)
- propose a first hypothesis based on the likely-looking groups
- reflect on whether that grouping works
- revise if needed then submit the final predictions
```
Having it start with word one of group one as the first output token seems unlikely to work from my intuition about what these models can do. Heck, I can't solve it that way! Burning some tokens on exploration and hypothesis building leaves it with the easier task of choosing plausible groups from the proposed options. System 2 thinking vs system 1 perhaps.
I've played around with the same problem, though I didn't do any fine-tuning. Some strategies that seemed promising:
- A two-pass approach where you prompt it to generate the groupings, then separately prompt it to find the words that belong into each group. (Which of the following words best fit the category "COMMON DOG NAMES"?). It does way better at the more specific queries.
- Don't tell it the constraints of 4 groups of 4 words; ask it for at least four groups of 2 or more words. Once you have 4+ groups of 4+ words, you can make logical inferences with your Python wrapper to come up with all the possible 4x4 groupings. If you're lucky there will only be one. If not... more queries to GPT, I guess, but I haven't figured this part out.
1. Have it do a thinking/brainstorming phase first to try to work out what the potential categories are.
2. Then ask it to scan over each word and think about what categories it could go in, in order of likelihood.
3. Ask it to do the final answer.
Format the training set in that way, as if it got everything right at each step (since you only have the right answers).
It sounds like you had 7 * 30 = around 200 examples. Maybe you can feed a batch of ten at a time and explain the game and try to get GPT-4 to generate more examples. You will have to see if they make sense.
I assume that by increasing the size of the dataset by a factor of ten, and having the LLM think through the problem using multiple steps, you will get significantly better results.
I spent about 30 minutes with GPT4 and tried lots of variations of pre-processing. I had it first list large numbers of possible categories, then try to consider one category with four words, then double-check that category and look at the remaining words, then go ahead with the next....
No matter how I instructed it to think, it frequently could not work out the very first category.
I spent a bunch of time manually just using GPT4 with fairly simple prompts and giving it the same feedback that the game gives. There's an archive of puzzles which I used to try to train it with, and sometimes it would be very successful, and sometimes it was frustrating how bad it was at doing basic things like keeping track of what words it had used so far. Each day I would also have it play the new puzzle from the NYTimes which it couldn't have trained on. Some days it did perfectly some it made really stupid mistakes. It seems like a more concerted effort could achieve better results.
It's a shame you can't just see the probability distributions for the 16 words and choose them yourself that way you never hallucinate a word and the groups are always 4 words long.
This is a great idea actually, this way you could also enforce that all the words appear _exactly once_ (by rejecting during sampling words that were already in the answer) which seemed to be a significant issue for me when I tried this.
I suspect the inability of the model to "plan ahead" is a significant contributor to its poor performance relative to a human. Being able to check a grouping to be sure it includes at least four words _and_ to check that it doesn't conflict with the other three groupings is a major advantage - it's pretty common that these puzzles include partial or incompatible red herring groups.
If this is the case, performance might be improved by taking the final solving responsibility away from the model and giving it to the script. You could ask GPT for categories, ask whether each word fits each category (discarding categories with fewer than 4 words), and then search for 4 non-overlapping categories.
(This might be missing the point of the exercise though.)
Would step-wise instruction help with the look ahead issue. Something like:
1. Here are 16 words. Find 4 that have something in common, and list the remaining 12 words.
2. Take the remaining 12 words from the previous answer and find 4 words that have something in common, and list the remaining 8 words.
etc. etc.
fun. good writeup. i tried setting up a custom gpt to run a game like anagramish.com - gave it the word list to choose from, instructions on the rules, etc - but no matter what i did in the prompt it would hallucinate start words or it would incorrectly accept invalid guesses (or mark correct guesses as invalid).
This game looks cool but wow the UX is terrible. Why can't you click & drag the words to reorder them? Seems half the difficulty is keeping track of your thought process with the inability to make a draft state.
Also, there is no reason to limit the amount of guesses. Just let me try until I figure it out. But no, they've put a limit so that it can be "sharable" in a tweet-sized text, to try to copy the viralness of Wordle. But they do it to the detriment of the gameplay, in such a way that I don't even bother playing.
Disagreed, the limit is what gives the game a constraint and makes it interesting IMHO. I like to have something that makes me fail because I care less about optimizing a score, more about beating it in the first place. Different people play games differently, etc.
I also don't see how it makes it "sharable". Wouldn't it be more sharable if they let everyone win and just give them a score?
No, it scores you based on the number of moves used. No need for an upper bound, could've let me use 20 guesses if that's what it takes (non native speaker). But that wouldn't fit their copy&paste result formatting..
There is a UI for the game without the upper bound at connections.swellgarfo.com. Personally that annoys me as my friends are prone to sharing massive walls of incorrect guesses when they do badly on that site, but it sounds like it would be a good fit for you.
Too many people would take a scatter-shot approach to solving, even with a score that keeps track of guesses, and then those same people would get bored and disillusioned with the game because it would be boring.
The limited number of attempts is precisely what makes it a game.
The format in the show it’s lifted from (Only Connect - greatest game show ever) is that the teams have 2 minutes total to solve the “connecting wall”. They can have as many guesses as they want until they solve the first two groups - after that it’s 3 strikes and you’re out.
>Why can't you click & drag the words to reorder them?
That level of difficulty is part of the game. There is a shuffle button to ease the ideas generation but most likely is was done like that by design.
This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well. Perhaps it would need a model to be specifically trained for this task. If alpha-zero can master chess, then surely this game isn't beyond what is trainable.
You can prompt Chat GPT that it'll be playing the connecting wall without having to explain the game. It still fails to make a good set of connections when provided the wall.
One interesting part of the "Connecting Wall" sets is that there is almost always a "Wordy one" involving changing a letter, adding a prefix, anagrams, etc. Almost always a "Person" one for example there'll be a set of "Famous people named Tom..." but not a set of "Toms" with a set of "Margarets", and then a couple of general sets.
This is a huge help given the 2 minutes and 30 seconds provided.
On another note, it's possible that the GCHQ puzzle book would be in the training set, which has many puzzles with solutions for this format and a very similar rubrik with 55 items and sets of sizes 1 through 10. That said, Chat GPT perhaps would not tie the answers in the back of the book to the solutions in the front.
I all, I think an AI trained for this purpose with problems and given solutions ought to end up mastering this format. But a general purpose chat GPT seems like it performs very badly.