You (or running software you interface with ChatGPT) needs to maintain the state...

the_af · on Feb 15, 2023

That's interesting! But it does show that the hype around ChatGPT is misplaced: as impressive as it is (and I do find it impressive) it doesn't really build a "model" of the conversation; you have to help it all the way or it will go astray, since nonsensical board game moves make as much sense from a conversational point of view. It's also easy to make it hallucinate nonfactual information, which makes it bad at exploring things you're unsure about and about which you could inadvertently write misleading questions (Note this isn't the same as asking misleading questions to a human who will answer confidently out of arrogance; ChatGPT has no "ego" but it will write completely false/nonfactual answers if asked to by mistake. I have examples of this).

It's easy to get confused about GPT's limitations because it's a pretty successful parrot, and it writes convincing conversations in a vast number of cases.

janalsncm · on Feb 15, 2023

In my experiments that doesn’t help. ChatGPT failed to play any valid move given a position, let alone a good one. The point is that language is probably too “linear” to represent what’s going on during a game. Pieces on the board have complex relationships (consider a pinned piece, for example) so autoregressive decoding is simply not enough.

williamcotton · on Feb 15, 2023

Try something like:

  The current board state is:
  
  board = [['', '', ''],
          ['', '', ''],
          ['', '', '']];

  write a javascript function called bestMove(board) that predicts the best tic-tac-toe move to make given a board. use that function to update the board state and return the board state in JSON form.

The response will have a bunch of functions like

  function bestMove(board) {
  function getEmptySpaces(board) {
  function predictBestMove(board, player) {
  function minimax(board, isMaximizing) {
  function checkWinner(board) {
  ...
  These functions should work together to determine the best move to make in a game of Tic-Tac-Toe, using the minimax algorithm to evaluate each possible move and choosing the one with the highest score.

Then eval and execute the bestMove function, passing in the initial board state, returning the updated board state. Then the human player makes a move.

Then another prompt:

Try something like:

  The current board state is:
  
  board = [['X', '', ''],
          ['', 'O', ''],
          ['', '', '']];

  assume there is a function called bestMove(board) and checkWinner(board) that predicts the best tic-tac-toe move to make given a board. use those functions to update the board state and check the winner and return the board state and current winner in JSON form.

etc...

Using my little engine, I get this solution:

  question: "Answering as [rowInt, colInt], writing custom predictBestMove, getEmptySpaces, minimax and checkWinner functions implemented in the thunk, what is the best tic-tac-toe move for player X on this board: [['X', '_', 'X'], ['_', '_', '_'], ['_', '_', '_']]?",
  answer: [ 0, 1 ],

https://gist.github.com/williamcotton/e6bdcca0a96a6e7bf5d2fe...

the_af · on Feb 15, 2023

Interesting. Have you tried playing a full game like this, instead of a single move?

In any case, I don't think this is what people expect out of ChatGPT. Your approach is too "programmer centric". I think people expect telling ChatGPT the rules of the game, in almost plain language, and then expect to be able to play a game of Tic Tac Toe interacting with it like one would with a person. This means, not asking it to write functions or remind it of the state of the board at every step.

This doesn't work consistently for a well-known game like Tic Tac Toe, much less for an arbitrary game you make up.

williamcotton · on Feb 16, 2023

> Interesting. Have you tried playing a full game like this, instead of a single move?

No, but it is correctly running the best move functions so through induction we can see it will successfully play a full game.

> I think people expect telling ChatGPT the rules of the game, in almost plain language, and then expect to be able to play a game of Tic Tac Toe interacting with it like one would with a person.

This is an unreasonable expectation for a large language model.

When a person computes the sum of two large numbers they do not use their language facilities. They probably require a pencil and pad so they can externalize the computational process. At the very least they are performing calculations in their head in a manner very different from the cognitive abilities used when they catch a ball.

Try playing a game like Risk without a board or pieces, that is, without a concrete mechanism to maintain state.

This approach isn’t cheating and an LLM acting as a translator is a key component. This doesn’t “prove that LLMs are useless bullshit generators, snicker snicker” because it can’t maintain state or do math very well, it just means you need to use other existing tools to do math and maintain state… like JS interpreters.

One thing that I think will improve is that a larger scale language model would need less internally specific terms for the solution in order to reliably get the same results.

Also, translations are necessarily lossy and somewhat arbitrary, so these results need to be considered probabilistically as well. Meaning, generate 10 different thunks and have them act as voting on an answers they compute.

the_af · on Feb 16, 2023

> No, but it is correctly running the best move functions so through induction we can see it will successfully play a full game.

I'm not convinced induction applies. ChatGPT tends to "go astray" in conversations where it needs to maintain state; even with your patch for this (essentially reminding it what the state is at every prompt) I would test it just to make sure it can run a game through completion, make good moves all the way, and be able to tell when the game is over.

I can make ChatGPT do single "reasonable" moves, the problem surfaces during a full game.

> This is an unreasonable expectation for a large language model.

Yes, but enough people hold it anyway that it is a concern. And it's made worse because in some contexts ChatGPT fakes this quite effectively!

williamcotton · on Feb 16, 2023

> I'm not convinced induction applies. ChatGPT tends to "go astray" in conversations where it needs to maintain state; even with your patch for this (essentially reminding it what the state is at every prompt) I would test it just to make sure it can run a game through completion, make good moves all the way, and be able to tell when the game is over.

You don't seem to understand what I am saying. ChatGPT cannot maintain state in a way that would be useful for playing a game. You must use a computer to interface with ChatGPT, like, via an API. And whatever program is calling ChatGPT needs to maintain the state of the game and can be used to iteratively call GPT.

So by induction once we know that the bestMove function is correct, which we have seen, we know that it will work at the start of any game and work until the game is finished.

I am definitely not talking about firing up the ChatGPT web user interface and trying to get it to magically maintain state.

> Yes, but enough people hold it anyway that it is a concern.

Some people hold this expectation because of a consistent barrage of straw man arguments, marketing hype, and fanboy gushing.

> And it's made worse because in some contexts ChatGPT fakes this quite effectively!

It turns out that a surprising number of computational tasks can be achieved by language models but that is not because they are doing actual computations. They are not at all reliably computers. I don't know where this misnomer came from and from what I can tell this has been known for years. No one has ever hid this fact and there have been solutions involving resorting to computations that have been part of published research for many moons now.

The problem is that most people just want to read clickbait and emote to score fake internet points and they don't want to put in the effort to actually learn about new things.

the_af · on Feb 16, 2023

We seem to be talking at cross purposes. I understand (at a very high level) what LLMs do, and I don't think they can do actual computation.

Why do you insist on things I've already said I understand? I know ChatGPT is not good at maintaining state -- though it can fake it convincingly (which understandably, seems to trip people up). I think it looks at your chat history within the session in order to generate the next response, which is why it can "degenerate" within a single session (but also, it's how it can fake and make it seem it's keeping state, by looking at the whole history before each reply).

I don't understand the rest of your answer. You seem to be really upset at "the people".

PS:

> So by induction once we know that the bestMove function is correct

"By induction", nope. Prove it. Run an actual full game instead of arguing with me. It will take you shorter to play the game than to debate with me.

kazinator · on Feb 16, 2023

What's the difference between keeping state and looking at the chat history?

Keeping state is something a human would have to do, because for a human, it would be very tedious and slow to re-read the history to recover context, relative to the timeliness expectation of the interlocutor.

the_af · on Feb 17, 2023

> What's the difference between keeping state and looking at the chat history?

That's an excellent question. I don't know. Intuitively, looking at the chat history would seem a way to keep history, right?

However, in my tests trying to play Tic Tac Toe (informally, not using javascript functions as the comment I was replying to) ChatGPT constantly failed. It claims to know the rules of Tic Tac Toe, yet it repeatedly forgot past board positions, making me think it's not capable of using the chat history to build a model of the game.

williamcotton · on Feb 16, 2023

Like, we could both be thinking and talking about things like, “I wonder which programming languages are better or worse for these tasks? Is it harder to translate to NASM or ARM64? Or C? Or Lisp? Which lisp performs better? What’s the relationship between training data and programming languages and is this separate from an inherent complexity of a programming language? Can we use LLMs to make objective measurements about programming language complexity?

I have done a little bit of testing and LLMs are objectively worse at writing ASM than JavaScript, which makes sense, because ASM is closer to the metal and properly transcribing into functional ASM requires knowledge of the complexities of a specific CPU, specific calling conventions for an OS, while in contrast JavaScript is closer to natural language so there’s less “work” for the translation task.

But no, instead you want to prove to me that ChatGPT is some parlor trick…

the_af · on Feb 16, 2023

> But no, instead you want to prove to me that ChatGPT is some parlor trick…

Excuse me, what?

I'm sorry, I've zero interest in discussing NASM or Lisp or whatnot. This was about the limitations of ChatGPT, not whatever strikes your fancy.