Edit too late to edit: Reading through the The Pile paper they define public data as such:
> Public data is data which is freely and readily available on the internet. This primarily excludes data which is pay-walled (regardless of how easy that paywall is to bypass) and data which cannot be easily obtained but can be obtained, e.g. through a torrent or on the dark web.
This should disqualify books3, but they use it to justify books3.
They need to extent that definition transitively. Maybe that’s my frustration. If I collect data from torrents and make it more freely and readily available, then it meets their definition of public data, which is basically what books3 is. If they included it themselves, it wouldn’t meet their definition but because someone else redistributed it first, it’s more okay for them to redistribute?
They've also scraped HackerNews posts. Since I posted blog links to HackerNews, does that mean they stole all of my blog posts? That represents three years of work and the chapters of 3 books that I intend to publish. They just took it and will start delivering it to their users to help them write more interesting content? Not okay.
> Public data is data which is freely and readily available on the internet. This primarily excludes data which is pay-walled (regardless of how easy that paywall is to bypass) and data which cannot be easily obtained but can be obtained, e.g. through a torrent or on the dark web.
This should disqualify books3, but they use it to justify books3.
They need to extent that definition transitively. Maybe that’s my frustration. If I collect data from torrents and make it more freely and readily available, then it meets their definition of public data, which is basically what books3 is. If they included it themselves, it wouldn’t meet their definition but because someone else redistributed it first, it’s more okay for them to redistribute?