Ask HN: Downloading 200M Files Fast

joey_spaztard · 2024-05-17T07:43:55

It depends on details that you have not told us.

Are you trying to download files from websites over http or https or are you doing something different?

Are you trying to spider a million different websites or one source?

I'm guessing that you are trying to spider lots of websites to get material to train an LLM, is that right?

Have you done the basic math: 1) Can your filesystem hold that many files? On a typical system the filesystem will use a minimum of 4KB for each file. There is some wasted space due to cluster size.

2) Do you have enough disk space?

3) Total size of the files and the speed of your internet connection? If the files are 1MB average and you could fully use 1Gbps internet then this would take a minimum of three weeks.

FezzikTheGiant · 2024-05-17T09:41:40

Trying to download audio data via http - I have a gcs bucket

leros · 2024-05-17T13:18:52

Have you tried downloading the bucket itself? Something like: https://stackoverflow.com/a/66816681

Might be faster than individual http downloads

FrenchDevRemote · 2024-05-17T12:28:19

A while ago I wrote a shell script to open like 100 instances on AWS, running a script on each that would be in charge of downloading a part of the files.

using curl or wget could probably be a bit faster than python, you should also run multiple requests concurrently, because it's unlikely that the sites can fill your whole bandwidth

You could probably do the same with some serverless/cloud functions but it could be a bit expensive

FezzikTheGiant · 2024-05-21T05:30:03

this is what i ended up doing

jventura · 2024-05-17T15:14:48

Look for ThreadPools or ThreadPoolExecutors in Python. What you need is listed on the ThreadPoolExecutor documentation [1].

[1] https://docs.python.org/3.12/library/concurrent.futures.html...

solardev · 2024-05-17T23:01:34

Can you tar them up and download them as one big file?