The simplified version is, Google sends the browser a one-time key, which the browser forwards to the HW token to sign with its private key. Then the browser sends this back to the web server to verify, using its copy of the HW token's public key.
This would be vulnerable to MITM attacks, as you say.
So what the protocol actually does is concatenate the nonce sent by the web server with the origin of the web page as seen by the browser and have the HW token sign that. This way the server can verify that the HW token signed the right nonce for the right origin.
See https://docs.google.com/document/d/1SjCwdrFbVPG1tYavO5RsSD1Q..., search for "origin".