I am embarrassed to say that I am confused about what the outputs are from the "Rapid" RL training system. Do you end up with an executable that then drives the game inputs/api? Does it produce a "bot script" that is used by the game to drive the logic? I understand that thousands of CPUs/GPUs are used for the training, but then what is actually playing the game at the end of the day?
Hi Gdb, next week I am giving a presentation on your awesome Dota work to the local data science community in vancouver BC. I have reviewed the info your team has released so far and i have a few questions:
- I saw no mention of CNNs, is it true CNNs are not used even for the 8x8 terrain grid input?
- do you have any comments about rapid+PPO vs say impala+vtrace? Would the ability to use more off-policy data be very helpful here?
- any comments on how you selected the reward constants?
- was the teamwork/tau something your team came up with or was this a known approach?
- the attention keys are most interesting, can you comment on why they dont flow through the lstm? Does it make it easier for the network to quickly change unit attention or some other reason?
- any comment of the choice of single-layer LSTM vs multilayer ostensibly for operating on longer timescales?
- does this result mean that HRL is less critical than some people thought?
- any comment on magnitude of compute, like in the post from may?
Could you go into some more detail on the actual engineering mechanics? Does each bot have an instance of the neural net model it runs a separate PC? How often do you feed game state into the net? What's the output of the network (bunch of movement / item / spell commands) that are fed in through the game driver?
Oh, good question, I didn't think of that either. there one NN that consumes the state for each of the bot players and then returns the "next action" for that bot, or is there a separate NN for each of the bots, and does that NN run on the LAN machine or is the LAN machine just running the game code and python agent which is mediating the game code and the NN?
We dump state from the bot API each tick and send it over GRPC to a Python agent, which formats the state into a tuple of Numpy arrays. That Numpy array is passed into 5 neural networks (one per agent), each of which returns a tuple of Numpy arrays. Each tuple is decoded into a semantic action, which is then returned to the game via GRPC.