A checkbox would make no difference in the case where the uploaded has not been granted the right to agree to such terms with respect to the code in question, leaving the matter in the same situation.
I'd like to propose a "golden rule" test. If Copilot was truly not at risk of regurgitating large blocks of code verbatim, why didn't Microsoft train it on proprietary Microsoft code as well? Why was it limited to user-submitted code on GitHub? If there is any argument pointing to licensing or copyright or patents, it stands to reason those concerns would apply to any corpus of user-submitted code since users could easily misrepresent the licensing.
Precisely. I made this exact point a while ago, how come Microsoft didn't submit the source code to Windows as part of the training data, that is at least code that they can plausibly claim they have the rights to.
There's a good point made over here: https://news.ycombinator.com/item?id=34282407 that it might accidentally spit out some kind of secret that is much smaller than a copyrightable piece of code. That's not a risk for code in public repositories.