Using that which belongs to others without their consent is theft. There isn’t much to debate, unless of course, you wish to benefit from that theft. Powerful corpos can train whatever they wish against the data they own. For instance, microsoft can train its bots on microsoft’s source code, instead of people’s code, but that is not going to happen because they are aware of the implications - the procedural generators are exactly that and nothing more. Meanwhile we cant use their products without a license.
So if they decides to play this game then so should we.
> Using that which belongs to others without their consent is theft
Are text snippets, thumbnnails and site caches shown by search engines (on an opt-out basis) "theft"? If you draw a car, which you can do due to having seen many individually-copyrighted car designs, are you stealing from auto manufacturers? Have I just committed theft by using a portion of your comment above as a quote?
I don't claim here that statistical model fitting inherently needs to be treated the same as the above examples, but rather use examples to show that the bar of "using" is far too broad.
Legally, copyright infringement in the US requires that the works are substantially similar and not covered by Fair Use. Morally, I believe that artificial scarcity, such as evergreening of medical patents, is detrimental and needs to be prevented wherever feasible - and wouldn't call any kind of copying/sharing/piracy "theft". The digital equivalent of theft is, for example, account theft where you're actually removing the object from the owner's possession.
Theft is the act of taking away someone else's property. "Using" (aka copying) the public data I create isn't theft, be it with my consent or without. It may be copyright infringement under certain conditions, but arguing that this infringement is stealing is like arguing that digital piracy and shoplifting are basically the same thing.
So if they decides to play this game then so should we.