I looked at the code, and it was straightforward: The "add number" view had a check for "ability to add number", and so did the homepage. The homepage would redirect you to the "add number" view if you could have a number but didn't, and the "add number" view would redirect you to the homepage if you couldn't have a number, with the message "you can't change your number", but there was no way for this to happen more than once (the checks are complementary).
Because this is a privacy-focused service, logs were minimal, basically only the path of the request, the time and the backend that served it. We managed to at least see the path the bug took, and, indeed, it was bounced multiple times between the two pages.
There was no clue as to this anywhere, the pages used complementary checks from data on the session, the session was stored on a central redis cache, all workers were running the same code, everything.
The bug remained elusive, until I noticed that the requests would be served first by worker 1, then by worker 2, then by 1 again, etc, until the cycle broke when a request for page 1 was served by worker 2. This could only mean that the workers couldn't agree on the check, and, sure enough, the configuration on one of the hosts had, mistakenly, pointed the cache to local memory rather than redis.
That was a pretty interesting bug.
