Why not upload those files separately, or in ZIP format?

blacha · on Feb 16, 2022

> Why not upload those files separately,

Doing S3 put requests for 260M files every week would cost around $1300 USD/week which was too much for our budget

> or in ZIP format?

We looked at zip's but due to the way the header (well central file directory) was laid out it mean that finding a specific file inside the zip would require the system to download most of the CFD.

The zip CFD is basically a list of header entries where they vary in size of 30 bytes + file_name length, to find a specific file you have to iterate the CFD until you find the file you want.

assuming you have a smallish archive (~1 million files) the CFD for the zip would be somewhere in the order of 50MB+ (depending on filename length)

Using a hash index you know exactly where in the index you need to look for the header entry, so you can use a range request to load the header entry

  offset = hash(file_name) % slot_count

Another file format which is gaining popularity recently is PMTiles[1] which uses tree index, however it is specifically for tiled geospatial data.

[1] https://github.com/protomaps/PMTiles

klauspost · on Feb 17, 2022

Nice tools!

When it is serverside, reading a 50MB CFD is a small task. And once it is read we can store the zipindex for even faster access.

We made 'zipindex' to purposely be a sparse, compact, but still reasonably fast representation of the CFD - just enough to be able to serve the file. Typically it is around a 8:1 reduction on the CFD, but it of course depends a lot on your file names as you say (the index is zstandard compressed).

Access time from fully compressed data to a random file entry is around 100ms with 1M files. Obviously if you keep the index in memory, it is much less. This time is pretty much linear which is why we recommend aiming for 10K file per archive, which makes the impact pretty minimal.

remram · on Feb 17, 2022

You mean the cost of the PUT requests becomes significant. That makes sense since AWS doesn't charge for incoming bandwidth. Thanks!