About the training data, cant the datasets from the Tulu3 Model by the Allen Institute be used?
They claim that they have used a fully open source training dataset.
My gut says a lot of attention needs to be given to building a community that focuses on open and reliable access to clean training data.
If a collective/coop of individuals and organizations with storage and network capacity could collaborate with each other to archive and index deduplicated training data that would be huge.
Perhaps this is already happening. I was looking at Red Pajama last year as an example.
Someone like myself could arrange to host 200+TB on high speed storage with a 10G public IP for example, then we get a bunch of us together and hopefully access to training datasets would be decentralized and uncensored in an idea setup.
Is all that in progress and I just need to learn how to join?
Is Red Pajama something to look at again?
Is there someone tracking datasets in detail like HuggingFace has all the models? I know a lot of datasets are on it also, but there is massive duplication.
It might need to involve some torrent or anonymity platforms to avoid problems like Books3 had when the use and availability of the data is restricted by some jurisdictions.
It also needs to incorporate some deduplication approach as I notice the same data is often repackaged with variations in format or specification.