Is this an accurate representation of the GPT driver loop?
def generate(prompt: str) -> str:
# Transforms a string into a list of tokens.
tokens = tokenize(prompt) # tokenize(prompt: str) -> list[int]
while True:
# Runs the algorithm.
# Returns tokens' probabilities: a list of 50257 floats, adding up to 1.
candidates = gpt2(tokens) # gpt2(tokens: list[int]) -> list[float]
# Selects the next token from the list of candidates
next_token = select_next_token(candidates)
# select_next_token(candidates: list[float]) -> int
# Append it to the list of tokens
tokens.append(next_token)
# Decide if we want to stop generating.
# It can be token counter, timeout, stopword or something else.
if should_stop_generating():
break
# Transform the list of tokens into a string
completion = detokenize(tokens) # detokenize(tokens: list[int]) -> str
return completion
because that looks a lot like a state machine implementing Shlemiel the painter's algorithm which throws doubt on the intrinsic compute cost of the generative exercise.
I think the "context window" that people refer to with large language models means there's a maximum number of tokens that are retained, with the oldest being discarded. The window is a sliding window.