OpenAI's o1 Playing Codenames

lolinder · 2025-01-25T15:41:20 1737819680

A small weakness in this test is that one of the keys to strategic Codenames play is understanding your partner. You're not just trying to connect the words, you're trying to connect them in a way that will be obvious to your partner. As a computing analogy: you're trying to serialize a few cards in a way that will be deserializable by the other player.

This test pairs o1 with itself, which means the serializer is the deserializer. So while it's impressive that it can link 4 words, most humans could also easily link 4 with as much stretching! We just don't tend to because we can't guarantee that the other human will make the same connections we did.

ModernMech · 2025-01-25T15:46:35 1737819995

lol I played this game with my family and they said my wife and I were cheating because I kept using inside jokes that made no sense to them but she would get immediately.

dgritsko · 2025-01-25T15:54:36 1737820476

That's a big part of what makes this game enjoyable - a clue that is very obvious to one person might not even cross the mind of someone else. To anyone reading this who hasn't played, it's definitely worth giving it a try.

slyn · 2025-01-25T18:18:47 1737829127

Agreed, big fan of codenames in general but it plays its best when you’re playing against / alongside people that you’ve known for a while. The metagaming aspect of structuring clues to who your partner is really takes it to the next level.

lupire · 2025-01-26T02:54:54 1737860094

Same for Taboo for me. It's why we married.

cyode · 2025-01-26T10:44:59 1737888299

Stretching? Never! I see your 4-clue, o1, and raise you “QUEUE” for 5:

  - Line (Standing in the queue…)
  - London (they’re all queued up, innit?)
  - Log (*backend distsys handwaving*)
  - Mail (what do you think an inbox is, anyway?!)
  - Round (homophone “Q” is a typographically round letter)

paulddraper · 2025-01-26T16:50:15 1737910215

I think Round may be invalid but in any case I would not have gotten it.

suveen_ellawela · 2025-01-25T23:19:24 1737847164

thanks for the comment. I actually tried explicitly mentioning in the prompt that 'Your guesser follows the same reasoning process'. But this did not make any clear improvements. Maybe I should've done more prompt engineering.

lolinder · 2025-01-26T00:50:14 1737852614

Nah, prompt engineering wouldn't have solved the fundamental issue, which is that the associations between ideas as stored in the weights will be the same between the two AI players, which makes it an easier game for them than for a human equivalent. It'd be like two copies of you playing on a team, having shared all the same experiences right up until the moment the game starts.

And don't get me wrong, it's still a fun experiment! It's just that that 4 would never have worked if a human played against another human—there are simply too many other words that would be equally strongly associated:

* Gum: Gum is often wrapped in paper, so 'GUM' is strongly associated with the word 'PAPER'.

* King: King is a type of face card, which are printed on paper, so 'KING' is strongly associated with the word 'PAPER'. (Repeat for JACK.)

* Light: Paper is a lightweight material.

That's 4 others right there that are at least as closely connected in my head as LAWYER or LOG. The only reason why o1 pulled up the same four when guessing as it did when clueing is that it's the same model.

Again, I didn't mean this as a knock, just a warning about drawing too many conclusions from the test!

lupire · 2025-01-26T02:54:08 1737860048

When I saw those 4 words I thought of "letter" or "writing". (But I likely wouldn't have thought of that cluster while scanning the full board.)

I think "paper" is a great clue, and those 4 words lawyer/mail/log/line match better than gum/king/light.

There's an even better reason for "lawyer/-paper" than chatgpt gave: lawyers "serve papers".

lolinder · 2025-01-26T03:34:17 1737862457

That we disagree on this is exactly why who you're playing with matters. I'd have never gotten to lawyer, certainly wouldn't have connected log. Line is a very faint possibility. Mail is the only one I'd have gotten for sure.

jncfhnb · 2025-01-25T16:16:30 1737821790

Ehhh I don’t think that’s accurate. The problem is not linking 4 words. It’s linking 4 words without accidentally triggering other, semantically adjacent words.

This task could probably be solved nearly just as well with old school word 2 vec embeddings

lolinder · 2025-01-25T16:32:40 1737822760

Right, that's what I meant to be getting at: when you connect 4 words with as much stretching as o1 did there, you're running a real risk that the other party connects a different set. Unless that other party is also you and has the same learned connections at top of mind.

furyofantares · 2025-01-25T21:15:16 1737839716

> This task could probably be solved nearly just as well with old school word 2 vec embeddings

I've tried. This approach is well beyond awful.

jncfhnb · 2025-01-27T02:52:24 1737946344

I see a few papers published that did exactly this successfully. It also just sounds crazy that it wouldn’t work well.

It’s odd to me that you would confidently claim it’s “beyond awful”.

furyofantares · 2025-01-27T22:09:54 1738015794

I'm confidently relaying my experience. But I get that I was extremely terse and overly general in my reply.

I haven't surveyed all the papers, although I have read some. And all the ones that I've seen that work okay -- do so by using a language graph or word association graph in their algorithm. Not just embeddings. Even then the results don't look good to me compared to human performance.

Why does it sound crazy that it wouldn't work well? Have you used word embeddings much? Maybe you have and have good reason to think this - I don't mean to imply otherwise. But it doesn't sound crazy to me that it wouldn't work well.

If I am wrong I would love to know it.

zeroonetwothree · 2025-01-25T15:11:27 1737817887

I don’t find this “super good”. It’s mostly giving 2 clues which is the most basic level of competence. The paper 4 clue is reasonable but a bit lucky (eg Jack is also a good guess). I also don’t see it actually using tactics properly, which I would consider part of being “super good”. The game isn’t just about picking a good clue each round!

Now obviously it’s still pretty decent at finding the clues. Probably better than a random human who hasn’t played much. Just I find the post’s level of hype overstated. It feels like the author isn’t very experienced with Codenames.

It would be interesting to compare AI:human vs human:human games to see which does better. It seems like AI:AI will overstate its success.

dang · 2025-01-25T21:46:10 1737841570

Ok, we've taken supergoodness out of the title now. Presumably the post is still interesting!

(Submitted title was "I got OpenAI o1 to play the boardgame Codenames and it's super good".)

groggo · 2025-01-25T15:19:44 1737818384

Can you elaborate on some of the more advanced tactics?

When I play, it's mostly about getting a good 2 clue each time. Then if you can opportunistically get a 3 or 4, that's awesome.

Some tactics come in for choosing the right pairs of 2's so you don't end up mismatched, or leaving clues that might be ambiguous with your opponent's... But that's mostly it.

It'll be fun for multiplayer! Just like how in other online games you can add in a AI to play as one of the players.

jncfhnb · 2025-01-25T16:20:32 1737822032

If you really want to get good, your goal is not so much to get as many tiles as possible, but rather to get the tiles that are semantically distinct from your opponent’s. A single mistake that triggers your opponent’s tile is generally enough to lose the game. And even if they don’t do it, having them uncover the tiles from their side that are semantically similar to your own team is also useful.

If you want to get nasty, you learn to abuse the fact that the tile layouts follow rules and that you can rule out certain tiles without considering the words.

groggo · 2025-01-25T19:26:39 1737833199

Memorizing the tile layouts is too much for me haha (imo against the spirit of the game). I usually play online now anyway so I hope they don't follow those same patterns as the physical version.

joshvm · 2025-01-26T16:37:56 1737909476

Online specifically avoids this by randomising the grids, except in some modes like mirrors where you can't do much to preserve symmetry.

bentcorner · 2025-01-25T19:26:56 1737833216

> you learn to abuse the fact that the tile layouts follow rules and that you can rule out certain tiles without considering the words.

Can you clarify? Isn't the card placement random?

tveita · 2025-01-25T22:01:08 1737842468

There are 40 setup cards with 4 possible rotations that specify agent placements, so it's theoretically possible to do some kind of memorization.

Personally I'd find that kind of play style very unfun, and would rather switch to fully randomized boards if I played enough that it became a problem.

https://danluu.com/codenames/

jncfhnb · 2025-01-25T20:00:30 1737835230

It’s randomish. There are facts about the possible layouts you can memorize.

I don’t know most of these rules but there’s never 5 in a row; or even 4 in a row if you’re the team with one fewer (second team to play).

Edit: because the game layout is determined by choosing one of a few dozen possible layout cards and randomly rotating it

mtmickush · 2025-01-25T15:40:38 1737819638

Other advanced tactics involve giving a broad clue that matches 3-4 of your own and just one other (either your opponents or a civilian). Your team can pick up all the matches across several turns and the one off doesn't hurt as much as the plus four helps

hunter2_ · 2025-01-25T16:26:20 1737822380

The S-tier tactic: When that high-number clue is cut short by a turn-ending mistake, the guessers tell their clue giver to inflate the number given during the totally unrelated next clue by however many remained from the truncated turn for which they don't need additional information to locate (and therefore it would be wasteful for a future clue to re-group those) so the stated number of that next clue must allow for its own cards plus the prior cards.

Example: The clue is "places 4" and the guessers choose 1 correctly and then 1 wrong answer, but they had achieved consensus about 2 others (and are confused about only the remaining 1). So the turns ends but they inform the clue giver to inflate by 2 next turn. That clue giver (after the other team goes) will then say the clue is "people 5" and the guessers will know that they shall select 2 places and 3 people.

This can cascade beyond just a pair of turns.

ruds · 2025-01-25T16:40:01 1737823201

I don't think this sort of communication from guessers to clue giver is in the spirit of the game (at least in my play group). However, inflating later clues is a reasonable approach! It's just that I don't think you're allowed to communicate the amount of inflation. Guessers must determine whether people 5 has slack to allow additional guesses on previous clues.

hunter2_ · 2025-01-25T16:43:32 1737823412

You're free to add additional prohibitions on communication as a house rule I guess, but the only prohibition in the rule book I've seen is that the clue giver's speech must consist exclusively of clues (and private consultation with the other clue giver). The clue giver is free to adjust their clue in reaction to anything they hear, and guessers can speak freely.

Important: the clue giver cannot acknowledge the instruction during gameplay. That would certainly extend beyond giving a clue! The guessers must know that their clue giver can play this way prior to the game commencing.

Edit: I just consulted the rules and this is the most relevant section:

> If you are a field operative, you should focus on the table when you are making your guesses. Do not make eye contact with the spymaster while you are guessing. This will help you avoid nonverbal cues.

> When your information is strictly limited to what can be conveyed with one word and one number, you are playing in the spirit of the game.

The author's use of the pronoun "you/your" switches from field ops in that first paragraph to spymasters in that second paragraph, confusingly. With that in mind, it boils down to this: field ops cannot seek non-clue information from spymasters, and spymasters cannot convey non-clue information. The strategy I'm suggesting involves neither!

ALittleLight · 2025-01-25T17:41:27 1737826887

If you take this idea of communication restrictions to the limit, you could imagine the guessers identifying N sets of cards by a single word each as they discuss their guess. The clue giver listens, then uses the clue that identifies the correct set of N cards.

You really just need an algorithm to generate unique sets of 8 or 9 from the whole board, and identifies those sets by a word.

groggo · 2025-01-25T19:42:44 1737834164

Yeah it's interesting to take these ideas to the extreme... even at the lower end I don't like it, I think zero communication outside of clues is the best way to follow the spirit of the game. But a little bit of banter and "kibitzing" is what makes it fun too.

spencerflem · 2025-01-26T01:08:14 1737853694

I played in a Codenames tournament at CGE's stand at GenCon, and they forbid guessers from communicating at all. Officially, its supposed to be just the clue and number and nothing else.

Of course, I never play this way in my own games

hunter2_ · 2025-01-26T06:16:53 1737872213

How do guessers arrive at a consensus about what card to touch, if they are forbidden from communicating at all?

spencerflem · 2025-01-27T03:27:51 1737948471

officially its a 4 player only game, at least at the tournament. I never do it this way myself though

ta_1138 · 2025-01-25T20:21:42 1737836502

The communication is only necessary/important if people haven't set this as a convention in the first place. I'll say that prior to ever looking at my clues: "I will give you higher numbers than what I said if you miss by more than 1. THe number I pick will always be high enough as to allow you to, with the +1 guess you get for free, make guesses on all the words I was hinting at.

There's also all kinds of not necessarily intended communicaton from the guessers in the fact that you can listen to which words they were considering and didn't pick. Nothing in the game attempt to say that you should not consider, say, whether they were going in the right or wrong direction in their guessing, but it sure can make a difference in how to approach later clues. If they were being very wrong, there might be a need to double up on words that you intended, and that your guessers missed.

In the same fashion, nothing in the game saying that I cannot listen to those guesses as a member of the other team, whether guesser or spymaster, and then change behaviors to make sure we don't hit words they considered as candidate words without very good reasons. Let them double dip on mistakes, or not make their difficult decisions easier. It's not as if the game demands that everyone that isn't currenly guessing should wear headphones to be sure they disregard what the other team says or does.

foota · 2025-01-25T20:45:07 1737837907

You can of course play however you want (and I certainly think this is clever), but imo this is likely against the spirit, and perhaps letter, of the rules.

The rule on giving clues is:

"If you are the spymaster, you are trying to think of a one-word clue that relates to some of the words your team is trying to guess. When you think you have a good clue, you say it. You also say one number, which tells your teammates how many codenames are related to your clue." (emphasis mine).

The rule states that the number should be the number of words related to the clue. There is later provisions allowing you to use zero and infinity, but outside of these carve-outs (and imo the "allowed" language is telling here, since it implies any other number not equal to the number of words is not allowed) I don't think this is legal.

pama · 2025-01-26T04:01:02 1737864062

We always allow any number when we play, because part of the thinking is we cannot be sure what the spy master has in mind. Of course, the number is related to the clue but possibly also to the game history up to that point. The teammates and opponents might interpret it wrong, and that’s OK. Infinity is typically used when there is enough info in principle to finish the game and a high risk if you dont; zero is super rare. We do tend to have very aggressive bids with tenuous connections, and 4 or 5 for a clue word are used in most games. Often, they don’t all work out in a single round, but on some lucky boards or in spousal teams, they occasionally work well.

hunter2_ · 2025-01-26T06:34:29 1737873269

You have a valid point, to which I'll concede. The rule book gives an example (spanning pages 4-5) where a guesser uses prior clues to select a card while the count is still within the number stated by the spymaster, but I suppose an allowance for guessers to deviate in this way does not also imply that spymasters may deviate in this way. Mea culpa!

Taking this a step further, given that it's well-known that a clue is deemed invalid when it pertains to cards in certain non-definitional ways (sounds-like, number of letters, etc.), it seems extremely reasonable to call a clue followed by N invalid if it doesn't pertain to N cards in a definitional way.

hunter2_ · 2025-01-26T06:48:07 1737874087

Indeed, a good Codenames-playing bot should know how to do all of this, in addition to using its LLM to generate great clues.

n4r9 · 2025-01-25T23:08:11 1737846491

Yeah, in fact we tend to play without a limit on the number of guesses, just to avoid this sort of loophole. In variants like Codenames Duet I think there's also no limit on the number of guesses.

Another thing the guessers can do if unsure about one of the tiles from the last round, is to tell the clue giver which tile they think it was. The clue giver then tries to give a clue that either tenuously links to it or clearly excludes it. That can give the clue more scope for linking to several other words. It risks giving information to the other team though so is more of an final turn play.

lostlogin · 2025-01-25T17:58:51 1737827931

> the one off doesn't hurt as much as the plus four helps

Doesn’t the turn end if you hit the opponents word?

topaz0 · 2025-01-25T18:39:09 1737830349

Yes, but they can go back for those words in future rounds

harrall · 2025-01-25T17:37:33 1737826653

I find the game more about reading the people on your team (and the other team) to understand how they think.

You have to give entirely different clues depending on the people you play with.

Sometimes you can also play adversarial and introduce doubt into the opposing team by giving topic-adjacent clues that cause them to avoid one of their own cards. It works better if someone on the other team tends to be a big doubter. It also can work when the other team constantly goes back and tries to pick n+1 cards that they think they missed from the last round, which gives you a lot of room to psychologically mess with them.

Sometimes you have a clue that only really matches 2, but because only 1 of the wrong matches is a neutral card and you could match 2 more by a massive stretch, you say “4.” Worse case, they get 2 right but then they pick the neutral card but in the best case, you stand to gain 4 for a clue that should only match 2.

I like Codenames because they are many meta ways to play the game. What makes Codenames unique is that, unlike a lot of other games (Catan, Secret Hitler, CAH, etc.), it’s an adversarial team game where the team dynamics and discussions are not secret so you can use them to your advantage.

blix · 2025-01-25T20:31:43 1737837103

experienced players who know their teammates well can reliably get 3-4s. if you only go for safe 2s against these opponents you will lose every time.

lupire · 2025-01-26T02:56:27 1737860187

They should at least play with two different AI models.

lsy · 2025-01-25T17:45:34 1737827134

Some of these clues wouldn't be very good for a human playing. "007" for example isn't a very good clue for "laser", not only because something happening to be in one of several films about a character doesn't rise to the typical level of salience, but also because other words on-board like "shark" and "astronaut" even moreso meet the criterion of featuring prominently in James Bond movies, and "astronaut" appears to be a game-ending choice.

croes · 2025-01-25T12:48:31 1737809311

Is that really surprising?

It’s basically the same brain playing with itself. Seems quite natural to link the code names to the same words.

Let different LLMs play.

deredede · 2025-01-25T13:05:21 1737810321

This is the take I thought I'd have, but in the last example, the guesser model reaches the correct conclusion using a different reasoning than the clue giver model.

The clue giver justifies the link of Paper and Log as "written records", and between Paper and Line as "lines of text". But the guesser model connects Paper and Log because "paper is made from logs" (reaching the conclusion through a different meaning of Log), and connects Paper and Line because "'lined paper' is a common type of paper".

Similarly, in the first example, the clue giver connects Monster and Lion because lions are "often depicted as a mythical beast or monster in legends" (a tenuous connection if you ask me), whereas the guesser model thought about King because of King Kong (which I also prefer to Lion).

wizzwizz4 · 2025-01-25T13:38:34 1737812314

> But the guesser model connects Paper and Log because "paper is made from logs" (reaching the conclusion through a different meaning of Log)

No, it doesn't. It reaches the conclusion because of vector similarity (simplified explanation): these explanations are post-hoc.

DominikPeters · 2025-01-25T15:46:00 1737819960

This is o1 so it need not be post hoc but the result of reasoning about several possible choices and explanations.

lmm · 2025-01-25T14:38:29 1737815909

> these explanations are post-hoc.

The best available evidence suggests this is also true of any explanations a human gives for their own behaviour; nevertheless we generally accept those at face value.

wizzwizz4 · 2025-01-25T19:03:07 1737831787

The explanations I give of my behaviour are post-hoc (unless I was paying attention), but I also assess their plausibility by going "if this were the case, how would I behave?" and seeing how well that prediction lines up with my actual behaviour. Over time, I get good at providing explanations that I have no reason to believe are false – which also tend to be explanations that allow other people to predict my behaviour (in ways I didn't anticipate).

GPT-based predictive text systems are incapable of introspection of any kind: they cannot execute the algorithm I execute when I'm giving explanations for my behaviour, nor can they execute any algorithm that might actually result in the explanations becoming or approaching truthfulness.

The GPT model is describing a fictional character named ChatGPT, and telling you why ChatGPT thinks a certain thing. ChatGPT-the-character is not the GPT model. The GPT model has no conception of itself, and cannot ever possibly develop a conception of itself (except through philosophical inquiry, which the system is incapable of for different reasons).

chongli · 2025-01-25T14:45:27 1737816327

Of course! If you’ve played Codenames and introspected on how you play you can see this in action. You pick a few words that feel similar and then try to justify them. Post-hoc rationalization in action.

topaz0 · 2025-01-25T18:29:33 1737829773

Except you also examine the rationalization as part of deciding whether to act on the impulse or not.

chongli · 2025-01-25T22:42:17 1737844937

Yes and you may search for other words that fit the rationalization to decide whether or not it's a good one. You can go even further if your teammates are people you know fairly well by bringing in your own knowledge of these people and how they might interpret the clues. There's a lot of strategy in Codenames and knowledge of vocabulary and related words is only part of it.

Angostura · 2025-01-25T13:54:59 1737813299

Sorry, I’m uninformed. Do you mean thaw the explanation could be completely unrelated to the actual “reason”

jncfhnb · 2025-01-25T16:23:53 1737822233

If an LLM states an answer and then provides a justification for that answer, the justification is entirely irrelevant to the reasoning the bot used. It might be that the semantics of the justification happen to align with the implied logic of the internal vector space, but it is best case a manufactured coincidence. It’s not different from you stating an answer and then telling the bot to justify it.

If an LLM is told to do reasoning and then state the answer, it follows that the answer is basically guaranteed to be derived from the previously generated reasoning.

ActivePattern · 2025-01-25T17:26:22 1737825982

The answer will likely match what the reasoning steps bring it to, but that doesn’t mean the computations by the LLM to get that answer are necessarily approximated by the outputted reasoning steps. E.g. you might have an LLM that is trained on many examples of Shakespearean text. If you ask it who the author of a given text is, it might give some more detailed rationale for why it is Shakepeare, when the real answer is “I have a large prior for Shakespeare”.

DiscourseFan · 2025-01-25T14:27:42 1737815262

Yes, the reason is that the model assigns words positions in an ever-changing vector space and evaluates relation by their correspondence in that space—the reply it gives is also a certain index of that space, with the “why” in the question giving it the weight of producing an “answer.”

Video series on the topic: https://www.3blue1brown.com/topics/neural-networks

Which is to say that “why” it gives those answers is because its statistically likely within its training data that when there are the words, “why did you connect line and log with paper” the text which follows could be “logs are made of wood and lines are in paper.” But that is not the specific relation of the 3 words in the model itself, which is just a complex vector space.

jprete · 2025-01-25T14:45:43 1737816343

I definitely think it's doing more than that here (at least inside of the vector-space computations). The model probably directly contains the paper-wood-log association.

unlikelymordant · 2025-01-25T13:18:59 1737811139

generally there is a "temperature" parameter that can be used to add some randomness or variety to the LLMs outputs by changing the likelihood of the next word being selected. This means you could just keep regenerating the same response and get different answers each time. each time it will give different plausible responses, and this is all from the same model. This doesn't mean it believes any of them, it just keeps hallucinating likely text, some of which will fit better than others. It is still very much the same brain (or set of trained parameters) playing with itself.

suveen_ellawela · 2025-01-25T23:23:10 1737847390

I wanted to play around with the temperature, but unfortunately o1 only supports '1' as the value.

elicksaur · 2025-01-25T15:18:36 1737818316

Or, have it play a human and compare human-human and llm-human pairs.

ushiroda80 · 2025-01-25T13:05:25 1737810325

Yeah not sure what’s impressive about this. Having the model be both the guesser and clue giver will of course have good results as it’s simply a reflections of o1’s weighting of tokens.

Interestingly this could be a way to potentially reverse engineer o1’s weightings

kennyloginz · 2025-01-25T11:37:04 1737805024

Could this just be a case of Reddit being included in the training data?

“ I read through codenames official rules to see if using "007" as a clue was allowed, and it turns out it is! To my surprise, I even came across a Reddit post where people were discussing and justifying why this clue fits perfectly within the rules.”

JohnMakin · 2025-01-25T13:26:17 1737811577

Yea, initially I thought this post was satire because of this.

suveen_ellawela · 2025-01-25T23:25:26 1737847526

that is a really interesting point. if it is true, this shows direct usage of a single training data point ( cus there are no other resources talking about this fact)

jprete · 2025-01-25T14:41:28 1737816088

Codenames is absolutely dead-center of what I expect Large Language Models to be good at. The fundamental skills of the game are: having an excellent embedding for word semantics and connotations; modeling other people's embeddings; a little bit of game strategy related to its competitive nature.

xnickb · 2025-01-25T12:18:28 1737807508

Somehow I expected AI to give clues that combine 4-5-6 words at a time. It's not at all impressive to me. And I'm not a serious player at all

vitus · 2025-01-25T13:40:04 1737812404

I am similarly less-than-impressed. If you click through to the website, you can watch the replay of one of the games mentioned in the article (the one with the clue "invader").

In that instance, the clues all matched 2-3 words, and the winning team got lucky twice (they guessed an unclued word using an unintended correlation, and their opponent guessed a different one of their unclued words.)

You also see a number of instances where the agents continue guessing words for a clue even though they've already gotten enough matches. For instance, in round 2, for the clue "Japan (2)", the blue team guesses sumo and cherry, then goes for a rather tenuous followup guess for round 1's 007 with "ring" (despite having gotten the two clued matches in the first round). A sillier example is in the final round, where the Red Team guesses 3 clues (thereby identifying all nine of their target words), then going ahead and guessing another word.

(For what it's worth, I think "shark" would have been a better guess for another 007 tie-in seeing as there are multiple Bond movies with sharks, but it's also not a match, and again, I wouldn't have gone for a third guess here when there were only two clued words.)

garretraziel · 2025-01-25T15:26:10 1737818770

This is allowed by the rules though. You can guess +1 to the number specified.

topaz0 · 2025-01-25T18:34:30 1737830070

They know it's allowed. It's also terrible and non-sensical strategy in the specific cases that are described.

pama · 2025-01-25T12:48:49 1737809329

I was wondering about the same. It is possible that the instructions didn’t try to make the gameplay as aggressive as possible. A good model could optimize the separator to make it easy to guess the most words possible. By having access to its own state, it should be possible to reach 5–6 words in most cases. There is an argument for keeping words around that would increase the difficulty of the opponents guessing large/clean separations, so it is possible that optimal play includes simple pairs on occasion. Very interesting application nonetheless.

vitus · 2025-01-25T13:40:40 1737812440

> It is possible that the instructions didn’t try to make the gameplay as aggressive as possible.

In case you're wondering, the prompts are available here: https://github.com/SuveenE/codenames-ai/blob/main/utils/prom...

pama · 2025-01-26T03:32:12 1737862332

Thanks!

wwtl12 · 2025-01-25T19:48:02 1737834482

The Mechanical Turk is super impressive if you don't know how it works.

captn3m0 · 2025-01-25T13:11:35 1737810695

I've been trying to do this with just word2vec, instead of throwing an LLM, since you just need to find a word with the appropriate distances optimized. https://github.com/captn3m0/ideas?tab=readme-ov-file#codenam...

qqqult · 2025-01-25T15:40:45 1737819645

I did that last summer, I compared the performance of different english word embedding models, as far as I remember the best ones were GloVe and a few knowledge graph word embeddings.

None of them were better than a human at giving hints for 3+ words though

zeroonetwothree · 2025-01-25T15:03:48 1737817428

I tried this many years ago (before LLMs) with hundreds of real human games and it was never that good.

dartos · 2025-01-25T14:07:08 1737814028

I love this.

Imagine the energy savings if more people didn’t just automatically reach for LLMs for their pet projects.

JaggerFoo · 2025-01-22T07:55:38 1737532538

I did this with Claude over the holidays. Putting Claude in the role as a guesser and comparing the guess to another experience human player. It turns out they both matched each other.

suveen_ellawela · 2025-01-22T17:19:43 1737566383

That's a nice experiment! I think codenames could definietly be an evaluation method for LLMs.

pieix · 2025-01-25T12:50:19 1737809419

Elo on different card games/board games would be a great eval metric now that the systems are general enough to play Codenames, chess, poker…

suveen_ellawela · 2025-01-25T23:38:00 1737848280

totally agree!

__MatrixMan__ · 2025-01-25T14:59:40 1737817180

It would be fun to build one, perhaps mediated by an app, where you have to guess whether your spymaster is a human or an AI based on the quality of their choices.

zeroonetwothree · 2025-01-25T15:05:48 1737817548

The average human is quite bad. It really works well when the spymaster is (a) experienced and (b) familiar with the other players.

__MatrixMan__ · 2025-01-25T15:21:58 1737818518

It's the (b) case I'm interested in. Like the spymaster loses if they can't subtly indicate to their friends that they're the real deal. Otherwise the robots win.

suveen_ellawela · 2025-01-25T23:39:41 1737848381

i thought of adding a feature where you can get your own spy master. you can give it all your personal info and the clues would be customized. the botteleneck is the other human spymaster has to help with updating the game state cus I(guesser) can't look at the spy master view.

jsemrau · 2025-01-25T15:40:50 1737819650

I have been doing some experiments with Agents, Reinforcement Learnings playing a 4x4 Tic Tac Toe game.[1]. Given my analysis of the "thought" process we are still really far from true understanding of such games. While in my game as well as OP"s, the rules are pre-trained and the models are good enough to reach a conclusion (which in itself is already impressive), it is still a long way.

[1] https://jdsemrau.substack.com/p/nemotron-vs-qwen-game-theory...

fercircularbuf · 2025-01-25T12:57:46 1737809866

I've intuitively felt that this general class of task is what these LLMs are absolutely best at. I'm not an expert on these things, but isn't this thanks to word embeddings and how words are mapped into high dimensional vector space within the model? I would imagine that because every word is mapped this way, finding a word that exists in the same area as mail, lawyer, log, and line in some vector space would be trivial for the model to do, right?

infinitifall · 2025-01-25T14:17:44 1737814664

More than just words. I've found LLMs immensely helpful for searching through the latent space or essence of quotes/books/movies/memes. I can ask things like "whats that book/movie set in X where Y happens" or "whats that quote by a P which goes something like Q" in my own paraphrased way and with a little prodding, expect the answer. You'd have no luck with traditional search engines unless someone has previously asked a similar question.

simonw · 2025-01-25T16:32:49 1737822769

I've been trying out various "reasoning" models (o1, R1, Gemini Thinking etc) against the NYT Connections word puzzle - it's a really interesting test of them. So far o1 Pro has been the most consistently successful: https://www.nytimes.com/games/connections

topaz0 · 2025-01-25T18:44:10 1737830650

Wonder if they use llms to write those puzzles

lynguist · 2025-01-25T18:45:56 1737830756

I kinda have the same very subjective feeling where o1 is the first AI that is clearly superior to me.

some_random · 2025-01-25T19:42:18 1737834138

I don't find this remotely compelling, I can easily come up with clues that make sense to me to connect a ton of words the difficulty is coming up with clues that others will look at the same way. The last example is exactly what I mean, "paper" makes sense for those 4 only when you explain it. If "Line" counts then why not "Gum" (which is typically wrapped in paper) or if "Lawyer" is valid then why not "King" (who's decrees are written on what?).

raphael1234 · 2025-01-25T15:35:51 1737819351

The recent DeepSeek model can also be super good on this kind of boardgame with strong reasoning. Although the privacy is a concern, see https://askflow.com/share/AI_community_reaction_DeepSeek__0a...

raphael123 · 2025-01-25T15:35:45 1737819345

The recent DeepSeek model can also be super good on this kind of boardgame with strong reasoning. Although the privacy is a concern, see https://askflow.com/share/AI_community_reaction_DeepSeek__0a...

macromaniac · 2025-01-25T16:38:23 1737823103

I made one where you play with the AI a few years back instead of AI v AI but never posted it anywhere if anyone wants to try, just updated it to gpt-4o-mini https://wordswithrobots.isotropic.us/

blakeburch · 2025-01-25T16:51:07 1737823867

Love the idea! Just wish you could clarify a number like you do in codenames. Otherwise, it just keeps going until all of its options are wrong.

macromaniac · 2025-01-25T17:13:09 1737825189

True, because then it feels more intentional (+ the extra strategy). It was definitely a bit thrown together- atm I only ever use it when I need a bit of practice before playing codenames.

suveen_ellawela · 2025-01-25T23:42:15 1737848535

cool stuff!

thrance · 2025-01-25T11:56:22 1737806182

I mean, it's playing against itself, not really a fair comparison to humans in my mind. The fun and hard part of this game is to get into your teammates brains and decipher what they possibly meant with what they played.

suveen_ellawela · 2025-01-25T23:47:01 1737848821

yea, didn't mean to take the fun out of the original game.

the idea for this came when we asked chatgpt how to connect the words 'carrot' and 'ray'. maybe you can give a try too!

thrance · 2025-01-26T00:04:51 1737849891

I still enjoyed reading your post, it's fun and interesting!

Maybe one could try having two different models play together, to see if they are genuinely good at the game or simply able to infer their own reasoning, if that makes sense.

I'm kinda bad at word games like codenames, even in my native language (french). With carrot and ray, I'd try something like "striation"? But it's really convoluted.

joaomacp · 2025-01-25T11:35:00 1737804900

I tried whatever the multi-modal paid ChatGPT model is on the Codenames Pictures version, and it didn't fare that well. Since they will probably scrape this comment and add it to next model's training data, I look forward to it getting good!

suveen_ellawela · 2025-01-25T23:42:49 1737848569

haha!

sylware · 2025-01-25T14:03:38 1737813818

If it can port c++ to C99+ and write correct 64bit risc-v assembly...

tweakimp · 2025-01-25T13:22:00 1737811320

It would be really interesting to see an LLM watch other players and learn how they think to find the best clues THEY need to hear to find the right words.

suveen_ellawela · 2025-01-25T23:49:25 1737848965

definietly an interesting approach. I started writing down gameplays when i play with friends. then eventually stopped to enjoy the moment.

Amekedl · 2025-01-25T17:14:28 1737825268

“o1 is more knowledgeable than the average human”

“the toyota yaris can move faster than the average human”

even opt-125m from years ago can pull more facts than the average human.

jerkstate · 2025-01-25T16:15:48 1737821748

You can pretty reliably get 2-clues and sometimes good 3-clues just using word2vec embedding similarity

suveen_ellawela · 2025-01-25T23:50:06 1737849006

agree. i think getting to have a look at o1's reasoning was pretty fun.

bongodongobob · 2025-01-25T17:40:19 1737826819

I played it with 3.5 and it was great. This isn't something o1 just picked up on.

suveen_ellawela · 2025-01-25T23:44:35 1737848675

yep, agree. One big part of the experiment was to see how well it does the reasoning by asking it to output the reasoning.

yantrams · 2025-01-25T18:04:06 1737828246

I cracked myself up with a ridiculous train of thought for fun while playing Codenames once. It went a little something like this

Star => Twinkle => Twinkle Khanna => Married to Akshay Kumar => Canadian Citizen => Maple Syrup ( Leaf ? )

suveen_ellawela · 2025-01-25T23:51:07 1737849067

haha, i've been in similar situations, but this one's something else.

progrus · 2025-01-25T13:38:44 1737812324

GPT-3 was superhuman at this too

suveen_ellawela · 2025-01-25T23:48:28 1737848908

yep, agree. One big part of the experiment was to see how well it does the reasoning by asking it to output the reasoning.

jinyang0220 · 2025-01-25T20:13:38 1737836018

Dude I looooooooooved that game. How long did u spend building it?

suveen_ellawela · 2025-01-25T23:28:17 1737847697

2 weeks!

badgersnake · 2025-01-25T14:41:49 1737816109

Or just play with your friends?

suveen_ellawela · 2025-01-25T23:45:02 1737848702

my friends were bad clue givers. i just had to switch to ai.

badgersnake · 2025-01-26T11:35:44 1737891344

That’s part of the fun, though.

tsroe · 2025-01-25T11:39:54 1737805194

Fun quirk about this game: If there aren't too many cards left and your teammate knows their powers of two, you have a winning strategy. You simply lay a mental bitmap over all remaining cards, setting 1 for cards that belong to your team and 0 for all others. You can then just say the number that is represented by this bitmap, e.g. "five" for 0101, and your teammate can decode it in their head. All numbers are, after all, single words. This means, if you are very good at mental maths or you allow for a calculator, you could also win every game in the first round. For me personally however, it only becomes feasible with around 10 cards remaining.

RedNifre · 2025-01-25T11:46:41 1737805601

That's against the rules.

tweakimp · 2025-01-25T13:15:21 1737810921

What if the game showed a different order of cards to every player?

wccrawford · 2025-01-25T19:27:41 1737833261

Because the original was a tabletop game, it can't.

The digital version could and should do this, IMO. (I don't actually know if it does, though, as I've only played the digital version a few times.)

Klaster_1 · 2025-01-25T11:51:58 1737805918

Guys I was playing with declared a similar move against the rules, so it was back to the old latent space search.

Smaug123 · 2025-01-25T11:56:38 1737806198

It is explicitly against the rules (https://czechgames.com/files/rules/codenames-rules-en.pdf), so they were correct. "Your clue must be about the meaning of the words. You can't use your clue to talk about the letters in a word or its position on the table."

andrepd · 2025-01-25T13:05:51 1737810351

This is explicitly against the rules.

suveen_ellawela · 2025-01-25T23:47:59 1737848879

goated.