A small weakness in this test is that one of the keys to strategic Codenames play is understanding your partner. You're not just trying to connect the words, you're trying to connect them in a way that will be obvious to your partner. As a computing analogy: you're trying to serialize a few cards in a way that will be deserializable by the other player.
This test pairs o1 with itself, which means the serializer is the deserializer. So while it's impressive that it can link 4 words, most humans could also easily link 4 with as much stretching! We just don't tend to because we can't guarantee that the other human will make the same connections we did.
lol I played this game with my family and they said my wife and I were cheating because I kept using inside jokes that made no sense to them but she would get immediately.
That's a big part of what makes this game enjoyable - a clue that is very obvious to one person might not even cross the mind of someone else. To anyone reading this who hasn't played, it's definitely worth giving it a try.
Agreed, big fan of codenames in general but it plays its best when you’re playing against / alongside people that you’ve known for a while. The metagaming aspect of structuring clues to who your partner is really takes it to the next level.
Stretching? Never! I see your 4-clue, o1, and raise you “QUEUE” for 5:
- Line (Standing in the queue…)
- London (they’re all queued up, innit?)
- Log (*backend distsys handwaving*)
- Mail (what do you think an inbox is, anyway?!)
- Round (homophone “Q” is a typographically round letter)
thanks for the comment. I actually tried explicitly mentioning in the prompt that 'Your guesser follows the same reasoning process'. But this did not make any clear improvements. Maybe I should've done more prompt engineering.
Nah, prompt engineering wouldn't have solved the fundamental issue, which is that the associations between ideas as stored in the weights will be the same between the two AI players, which makes it an easier game for them than for a human equivalent. It'd be like two copies of you playing on a team, having shared all the same experiences right up until the moment the game starts.
And don't get me wrong, it's still a fun experiment! It's just that that 4 would never have worked if a human played against another human—there are simply too many other words that would be equally strongly associated:
* Gum: Gum is often wrapped in paper, so 'GUM' is strongly associated with the word 'PAPER'.
* King: King is a type of face card, which are printed on paper, so 'KING' is strongly associated with the word 'PAPER'. (Repeat for JACK.)
* Light: Paper is a lightweight material.
That's 4 others right there that are at least as closely connected in my head as LAWYER or LOG. The only reason why o1 pulled up the same four when guessing as it did when clueing is that it's the same model.
Again, I didn't mean this as a knock, just a warning about drawing too many conclusions from the test!
That we disagree on this is exactly why who you're playing with matters. I'd have never gotten to lawyer, certainly wouldn't have connected log. Line is a very faint possibility. Mail is the only one I'd have gotten for sure.
Ehhh I don’t think that’s accurate. The problem is not linking 4 words. It’s linking 4 words without accidentally triggering other, semantically adjacent words.
This task could probably be solved nearly just as well with old school word 2 vec embeddings
Right, that's what I meant to be getting at: when you connect 4 words with as much stretching as o1 did there, you're running a real risk that the other party connects a different set. Unless that other party is also you and has the same learned connections at top of mind.
I'm confidently relaying my experience. But I get that I was extremely terse and overly general in my reply.
I haven't surveyed all the papers, although I have read some. And all the ones that I've seen that work okay -- do so by using a language graph or word association graph in their algorithm. Not just embeddings. Even then the results don't look good to me compared to human performance.
Why does it sound crazy that it wouldn't work well? Have you used word embeddings much? Maybe you have and have good reason to think this - I don't mean to imply otherwise. But it doesn't sound crazy to me that it wouldn't work well.
I don’t find this “super good”. It’s mostly giving 2 clues which is the most basic level of competence. The paper 4 clue is reasonable but a bit lucky (eg Jack is also a good guess). I also don’t see it actually using tactics properly, which I would consider part of being “super good”. The game isn’t just about picking a good clue each round!
Now obviously it’s still pretty decent at finding the clues. Probably better than a random human who hasn’t played much. Just I find the post’s level of hype overstated. It feels like the author isn’t very experienced with Codenames.
It would be interesting to compare AI:human vs human:human games to see which does better. It seems like AI:AI will overstate its success.
Can you elaborate on some of the more advanced tactics?
When I play, it's mostly about getting a good 2 clue each time. Then if you can opportunistically get a 3 or 4, that's awesome.
Some tactics come in for choosing the right pairs of 2's so you don't end up mismatched, or leaving clues that might be ambiguous with your opponent's... But that's mostly it.
It'll be fun for multiplayer! Just like how in other online games you can add in a AI to play as one of the players.
If you really want to get good, your goal is not so much to get as many tiles as possible, but rather to get the tiles that are semantically distinct from your opponent’s. A single mistake that triggers your opponent’s tile is generally enough to lose the game. And even if they don’t do it, having them uncover the tiles from their side that are semantically similar to your own team is also useful.
If you want to get nasty, you learn to abuse the fact that the tile layouts follow rules and that you can rule out certain tiles without considering the words.
Memorizing the tile layouts is too much for me haha (imo against the spirit of the game). I usually play online now anyway so I hope they don't follow those same patterns as the physical version.
There are 40 setup cards with 4 possible rotations that specify agent placements, so it's theoretically possible to do some kind of memorization.
Personally I'd find that kind of play style very unfun, and would rather switch to fully randomized boards if I played enough that it became a problem.
Other advanced tactics involve giving a broad clue that matches 3-4 of your own and just one other (either your opponents or a civilian). Your team can pick up all the matches across several turns and the one off doesn't hurt as much as the plus four helps
The S-tier tactic: When that high-number clue is cut short by a turn-ending mistake, the guessers tell their clue giver to inflate the number given during the totally unrelated next clue by however many remained from the truncated turn for which they don't need additional information to locate (and therefore it would be wasteful for a future clue to re-group those) so the stated number of that next clue must allow for its own cards plus the prior cards.
Example: The clue is "places 4" and the guessers choose 1 correctly and then 1 wrong answer, but they had achieved consensus about 2 others (and are confused about only the remaining 1). So the turns ends but they inform the clue giver to inflate by 2 next turn. That clue giver (after the other team goes) will then say the clue is "people 5" and the guessers will know that they shall select 2 places and 3 people.
I don't think this sort of communication from guessers to clue giver is in the spirit of the game (at least in my play group). However, inflating later clues is a reasonable approach! It's just that I don't think you're allowed to communicate the amount of inflation. Guessers must determine whether people 5 has slack to allow additional guesses on previous clues.
You're free to add additional prohibitions on communication as a house rule I guess, but the only prohibition in the rule book I've seen is that the clue giver's speech must consist exclusively of clues (and private consultation with the other clue giver). The clue giver is free to adjust their clue in reaction to anything they hear, and guessers can speak freely.
Important: the clue giver cannot acknowledge the instruction during gameplay. That would certainly extend beyond giving a clue! The guessers must know that their clue giver can play this way prior to the game commencing.
Edit: I just consulted the rules and this is the most relevant section:
> If you are a field operative, you should focus on the table when you are making your guesses. Do not make eye contact with the spymaster while you are guessing. This will help you avoid nonverbal cues.
> When your information is strictly limited to what can be conveyed with one word and one number, you are playing in the spirit of the game.
The author's use of the pronoun "you/your" switches from field ops in that first paragraph to spymasters in that second paragraph, confusingly. With that in mind, it boils down to this: field ops cannot seek non-clue information from spymasters, and spymasters cannot convey non-clue information. The strategy I'm suggesting involves neither!
If you take this idea of communication restrictions to the limit, you could imagine the guessers identifying N sets of cards by a single word each as they discuss their guess. The clue giver listens, then uses the clue that identifies the correct set of N cards.
You really just need an algorithm to generate unique sets of 8 or 9 from the whole board, and identifies those sets by a word.
Yeah it's interesting to take these ideas to the extreme... even at the lower end I don't like it, I think zero communication outside of clues is the best way to follow the spirit of the game. But a little bit of banter and "kibitzing" is what makes it fun too.
I played in a Codenames tournament at CGE's stand at GenCon, and they forbid guessers from communicating at all. Officially, its supposed to be just the clue and number and nothing else.
The communication is only necessary/important if people haven't set this as a convention in the first place. I'll say that prior to ever looking at my clues: "I will give you higher numbers than what I said if you miss by more than 1. THe number I pick will always be high enough as to allow you to, with the +1 guess you get for free, make guesses on all the words I was hinting at.
There's also all kinds of not necessarily intended communicaton from the guessers in the fact that you can listen to which words they were considering and didn't pick. Nothing in the game attempt to say that you should not consider, say, whether they were going in the right or wrong direction in their guessing, but it sure can make a difference in how to approach later clues. If they were being very wrong, there might be a need to double up on words that you intended, and that your guessers missed.
In the same fashion, nothing in the game saying that I cannot listen to those guesses as a member of the other team, whether guesser or spymaster, and then change behaviors to make sure we don't hit words they considered as candidate words without very good reasons. Let them double dip on mistakes, or not make their difficult decisions easier. It's not as if the game demands that everyone that isn't currenly guessing should wear headphones to be sure they disregard what the other team says or does.
You can of course play however you want (and I certainly think this is clever), but imo this is likely against the spirit, and perhaps letter, of the rules.
The rule on giving clues is:
"If you are the spymaster, you are trying to think of a one-word clue that relates to some of the words
your team is trying to guess. When you think you have a good clue, you say it. You also say one
number, which tells your teammates how many codenames are related to your clue." (emphasis mine).
The rule states that the number should be the number of words related to the clue. There is later provisions allowing you to use zero and infinity, but outside of these carve-outs (and imo the "allowed" language is telling here, since it implies any other number not equal to the number of words is not allowed) I don't think this is legal.
We always allow any number when we play, because part of the thinking is we cannot be sure what the spy master has in mind. Of course, the number is related to the clue but possibly also to the game history up to that point. The teammates and opponents might interpret it wrong, and that’s OK. Infinity is typically used when there is enough info in principle to finish the game and a high risk if you dont; zero is super rare. We do tend to have very aggressive bids with tenuous connections, and 4 or 5 for a clue word are used in most games. Often, they don’t all work out in a single round, but on some lucky boards or in spousal teams, they occasionally work well.
You have a valid point, to which I'll concede. The rule book gives an example (spanning pages 4-5) where a guesser uses prior clues to select a card while the count is still within the number stated by the spymaster, but I suppose an allowance for guessers to deviate in this way does not also imply that spymasters may deviate in this way. Mea culpa!
Taking this a step further, given that it's well-known that a clue is deemed invalid when it pertains to cards in certain non-definitional ways (sounds-like, number of letters, etc.), it seems extremely reasonable to call a clue followed by N invalid if it doesn't pertain to N cards in a definitional way.
Yeah, in fact we tend to play without a limit on the number of guesses, just to avoid this sort of loophole. In variants like Codenames Duet I think there's also no limit on the number of guesses.
Another thing the guessers can do if unsure about one of the tiles from the last round, is to tell the clue giver which tile they think it was. The clue giver then tries to give a clue that either tenuously links to it or clearly excludes it. That can give the clue more scope for linking to several other words. It risks giving information to the other team though so is more of an final turn play.
I find the game more about reading the people on your team (and the other team) to understand how they think.
You have to give entirely different clues depending on the people you play with.
Sometimes you can also play adversarial and introduce doubt into the opposing team by giving topic-adjacent clues that cause them to avoid one of their own cards. It works better if someone on the other team tends to be a big doubter. It also can work when the other team constantly goes back and tries to pick n+1 cards that they think they missed from the last round, which gives you a lot of room to psychologically mess with them.
Sometimes you have a clue that only really matches 2, but because only 1 of the wrong matches is a neutral card and you could match 2 more by a massive stretch, you say “4.” Worse case, they get 2 right but then they pick the neutral card but in the best case, you stand to gain 4 for a clue that should only match 2.
I like Codenames because they are many meta ways to play the game. What makes Codenames unique is that, unlike a lot of other games (Catan, Secret Hitler, CAH, etc.), it’s an adversarial team game where the team dynamics and discussions are not secret so you can use them to your advantage.
Some of these clues wouldn't be very good for a human playing. "007" for example isn't a very good clue for "laser", not only because something happening to be in one of several films about a character doesn't rise to the typical level of salience, but also because other words on-board like "shark" and "astronaut" even moreso meet the criterion of featuring prominently in James Bond movies, and "astronaut" appears to be a game-ending choice.
This is the take I thought I'd have, but in the last example, the guesser model reaches the correct conclusion using a different reasoning than the clue giver model.
The clue giver justifies the link of Paper and Log as "written records", and between Paper and Line as "lines of text". But the guesser model connects Paper and Log because "paper is made from logs" (reaching the conclusion through a different meaning of Log), and connects Paper and Line because "'lined paper' is a common type of paper".
Similarly, in the first example, the clue giver connects Monster and Lion because lions are "often depicted as a mythical beast or monster in legends" (a tenuous connection if you ask me), whereas the guesser model thought about King because of King Kong (which I also prefer to Lion).
The best available evidence suggests this is also true of any explanations a human gives for their own behaviour; nevertheless we generally accept those at face value.
The explanations I give of my behaviour are post-hoc (unless I was paying attention), but I also assess their plausibility by going "if this were the case, how would I behave?" and seeing how well that prediction lines up with my actual behaviour. Over time, I get good at providing explanations that I have no reason to believe are false – which also tend to be explanations that allow other people to predict my behaviour (in ways I didn't anticipate).
GPT-based predictive text systems are incapable of introspection of any kind: they cannot execute the algorithm I execute when I'm giving explanations for my behaviour, nor can they execute any algorithm that might actually result in the explanations becoming or approaching truthfulness.
The GPT model is describing a fictional character named ChatGPT, and telling you why ChatGPT thinks a certain thing. ChatGPT-the-character is not the GPT model. The GPT model has no conception of itself, and cannot ever possibly develop a conception of itself (except through philosophical inquiry, which the system is incapable of for different reasons).
Of course! If you’ve played Codenames and introspected on how you play you can see this in action. You pick a few words that feel similar and then try to justify them. Post-hoc rationalization in action.
Yes and you may search for other words that fit the rationalization to decide whether or not it's a good one. You can go even further if your teammates are people you know fairly well by bringing in your own knowledge of these people and how they might interpret the clues. There's a lot of strategy in Codenames and knowledge of vocabulary and related words is only part of it.
If an LLM states an answer and then provides a justification for that answer, the justification is entirely irrelevant to the reasoning the bot used. It might be that the semantics of the justification happen to align with the implied logic of the internal vector space, but it is best case a manufactured coincidence. It’s not different from you stating an answer and then telling the bot to justify it.
If an LLM is told to do reasoning and then state the answer, it follows that the answer is basically guaranteed to be derived from the previously generated reasoning.
The answer will likely match what the reasoning steps bring it to, but that doesn’t mean the computations by the LLM to get that answer are necessarily approximated by the outputted reasoning steps. E.g. you might have an LLM that is trained on many examples of Shakespearean text. If you ask it who the author of a given text is, it might give some more detailed rationale for why it is Shakepeare, when the real answer is “I have a large prior for Shakespeare”.
Yes, the reason is that the model assigns words positions in an ever-changing vector space and evaluates relation by their correspondence in that space—the reply it gives is also a certain index of that space, with the “why” in the question giving it the weight of producing an “answer.”
Which is to say that “why” it gives those answers is because its statistically likely within its training data that when there are the words, “why did you connect line and log with paper” the text which follows could be “logs are made of wood and lines are in paper.” But that is not the specific relation of the 3 words in the model itself, which is just a complex vector space.
I definitely think it's doing more than that here (at least inside of the vector-space computations). The model probably directly contains the paper-wood-log association.
generally there is a "temperature" parameter that can be used to add some randomness or variety to the LLMs outputs by changing the likelihood of the next word being selected. This means you could just keep regenerating the same response and get different answers each time. each time it will give different plausible responses, and this is all from the same model. This doesn't mean it believes any of them, it just keeps hallucinating likely text, some of which will fit better than others. It is still very much the same brain (or set of trained parameters) playing with itself.
Yeah not sure what’s impressive about this. Having the model be both the guesser and clue giver will of course have good results as it’s simply a reflections of o1’s weighting of tokens.
Interestingly this could be a way to potentially reverse engineer o1’s weightings
Could this just be a case of Reddit being included in the training data?
“ I read through codenames official rules to see if using "007" as a clue was allowed, and it turns out it is! To my surprise, I even came across a Reddit post where people were discussing and justifying why this clue fits perfectly within the rules.”
that is a really interesting point. if it is true, this shows direct usage of a single training data point ( cus there are no other resources talking about this fact)
Codenames is absolutely dead-center of what I expect Large Language Models to be good at. The fundamental skills of the game are: having an excellent embedding for word semantics and connotations; modeling other people's embeddings; a little bit of game strategy related to its competitive nature.
I am similarly less-than-impressed. If you click through to the website, you can watch the replay of one of the games mentioned in the article (the one with the clue "invader").
In that instance, the clues all matched 2-3 words, and the winning team got lucky twice (they guessed an unclued word using an unintended correlation, and their opponent guessed a different one of their unclued words.)
You also see a number of instances where the agents continue guessing words for a clue even though they've already gotten enough matches. For instance, in round 2, for the clue "Japan (2)", the blue team guesses sumo and cherry, then goes for a rather tenuous followup guess for round 1's 007 with "ring" (despite having gotten the two clued matches in the first round). A sillier example is in the final round, where the Red Team guesses 3 clues (thereby identifying all nine of their target words), then going ahead and guessing another word.
(For what it's worth, I think "shark" would have been a better guess for another 007 tie-in seeing as there are multiple Bond movies with sharks, but it's also not a match, and again, I wouldn't have gone for a third guess here when there were only two clued words.)
I was wondering about the same. It is possible that the instructions didn’t try to make the gameplay as aggressive as possible. A good model could optimize the separator to make it easy to guess the most words possible. By having access to its own state, it should be possible to reach 5–6 words in most cases. There is an argument for keeping words around that would increase the difficulty of the opponents guessing large/clean separations, so it is possible that optimal play includes simple pairs on occasion. Very interesting application nonetheless.
I did that last summer, I compared the performance of different english word embedding models, as far as I remember the best ones were GloVe and a few knowledge graph word embeddings.
None of them were better than a human at giving hints for 3+ words though
I did this with Claude over the holidays. Putting Claude in the role as a guesser and comparing the guess to another experience human player. It turns out they both matched each other.
It would be fun to build one, perhaps mediated by an app, where you have to guess whether your spymaster is a human or an AI based on the quality of their choices.
It's the (b) case I'm interested in. Like the spymaster loses if they can't subtly indicate to their friends that they're the real deal. Otherwise the robots win.
i thought of adding a feature where you can get your own spy master. you can give it all your personal info and the clues would be customized. the botteleneck is the other human spymaster has to help with updating the game state cus I(guesser) can't look at the spy master view.
I have been doing some experiments with Agents, Reinforcement Learnings playing a 4x4 Tic Tac Toe game.[1]. Given my analysis of the "thought" process we are still really far from true understanding of such games. While in my game as well as OP"s, the rules are pre-trained and the models are good enough to reach a conclusion (which in itself is already impressive), it is still a long way.
I've intuitively felt that this general class of task is what these LLMs are absolutely best at. I'm not an expert on these things, but isn't this thanks to word embeddings and how words are mapped into high dimensional vector space within the model? I would imagine that because every word is mapped this way, finding a word that exists in the same area as mail, lawyer, log, and line in some vector space would be trivial for the model to do, right?
More than just words. I've found LLMs immensely helpful for searching through the latent space or essence of quotes/books/movies/memes. I can ask things like "whats that book/movie set in X where Y happens" or "whats that quote by a P which goes something like Q" in my own paraphrased way and with a little prodding, expect the answer. You'd have no luck with traditional search engines unless someone has previously asked a similar question.
I've been trying out various "reasoning" models (o1, R1, Gemini Thinking etc) against the NYT Connections word puzzle - it's a really interesting test of them. So far o1 Pro has been the most consistently successful: https://www.nytimes.com/games/connections
I don't find this remotely compelling, I can easily come up with clues that make sense to me to connect a ton of words the difficulty is coming up with clues that others will look at the same way. The last example is exactly what I mean, "paper" makes sense for those 4 only when you explain it. If "Line" counts then why not "Gum" (which is typically wrapped in paper) or if "Lawyer" is valid then why not "King" (who's decrees are written on what?).
I made one where you play with the AI a few years back instead of AI v AI but never posted it anywhere if anyone wants to try, just updated it to gpt-4o-mini https://wordswithrobots.isotropic.us/
True, because then it feels more intentional (+ the extra strategy). It was definitely a bit thrown together- atm I only ever use it when I need a bit of practice before playing codenames.
I mean, it's playing against itself, not really a fair comparison to humans in my mind. The fun and hard part of this game is to get into your teammates brains and decipher what they possibly meant with what they played.
I still enjoyed reading your post, it's fun and interesting!
Maybe one could try having two different models play together, to see if they are genuinely good at the game or simply able to infer their own reasoning, if that makes sense.
I'm kinda bad at word games like codenames, even in my native language (french). With carrot and ray, I'd try something like "striation"? But it's really convoluted.
I tried whatever the multi-modal paid ChatGPT model is on the Codenames Pictures version, and it didn't fare that well. Since they will probably scrape this comment and add it to next model's training data, I look forward to it getting good!
It would be really interesting to see an LLM watch other players and learn how they think to find the best clues THEY need to hear to find the right words.
Fun quirk about this game: If there aren't too many cards left and your teammate knows their powers of two, you have a winning strategy.
You simply lay a mental bitmap over all remaining cards, setting 1 for cards that belong to your team and 0 for all others.
You can then just say the number that is represented by this bitmap, e.g. "five" for 0101, and your teammate can decode it in their head. All numbers are, after all, single words.
This means, if you are very good at mental maths or you allow for a calculator, you could also win every game in the first round.
For me personally however, it only becomes feasible with around 10 cards remaining.
It is explicitly against the rules (https://czechgames.com/files/rules/codenames-rules-en.pdf), so they were correct. "Your clue must be about the meaning of the words. You can't use your clue to talk about the letters in a word or its position on the table."
This test pairs o1 with itself, which means the serializer is the deserializer. So while it's impressive that it can link 4 words, most humans could also easily link 4 with as much stretching! We just don't tend to because we can't guarantee that the other human will make the same connections we did.
reply