Are you trying to download files from websites over http or https or are you doing something different?
Are you trying to spider a million different websites or one source?
I'm guessing that you are trying to spider lots of websites to get material to train an LLM, is that right?
Have you done the basic math: 1) Can your filesystem hold that many files? On a typical system the filesystem will use a minimum of 4KB for each file. There is some wasted space due to cluster size.
2) Do you have enough disk space?
3) Total size of the files and the speed of your internet connection? If the files are 1MB average and you could fully use 1Gbps internet then this would take a minimum of three weeks.
A while ago I wrote a shell script to open like 100 instances on AWS, running a script on each that would be in charge of downloading a part of the files.
using curl or wget could probably be a bit faster than python, you should also run multiple requests concurrently, because it's unlikely that the sites can fill your whole bandwidth
You could probably do the same with some serverless/cloud functions but it could be a bit expensive
Are you trying to download files from websites over http or https or are you doing something different?
Are you trying to spider a million different websites or one source?
I'm guessing that you are trying to spider lots of websites to get material to train an LLM, is that right?
Have you done the basic math: 1) Can your filesystem hold that many files? On a typical system the filesystem will use a minimum of 4KB for each file. There is some wasted space due to cluster size.
2) Do you have enough disk space?
3) Total size of the files and the speed of your internet connection? If the files are 1MB average and you could fully use 1Gbps internet then this would take a minimum of three weeks.