How can that be done? That's between my phone and Google, so how can they "listen in" on that?
The simplified version is, Google sends the browser a one-time key, which the browser forwards to the HW token to sign with its private key. Then the browser sends this back to the web server to verify, using its copy of the HW token's public key.
This would be vulnerable to MITM attacks, as you say.
So what the protocol actually does is concatenate the nonce sent by the web server with the origin of the web page as seen by the browser and have the HW token sign that. This way the server can verify that the HW token signed the right nonce for the right origin.
See https://docs.google.com/document/d/1SjCwdrFbVPG1tYavO5RsSD1Q..., search for "origin".
So… whichever login attempt gets to confirmation stage last wins (not relevant in this situation), and the confirmation screen on (at least) my phone does not indicate anything regarding location (which is highly relevant).
This looks a little weaker than TOTP (you're basically trading a little security for the convenience of not entering a code while keeping the second factor) and a lot weaker than U2F.