but the size of the problem in total bytes may be more of a problem than the actual looking up and indexing of the data.
One cool property of such a large data-set (~200TB) is that you are pretty sure to see a lot of the same data repeated.
It'd be neat to try reduce the overall data footprint by assigning some sort of signature to repeats. I.e. if you are storing the sequences (let's imagine for a sec that these are non-trivial sizes) ABC, ABCD, ABCDE and ABCDEF , replace ABC with say #1 etc. and perhaps save a whole lot of space.
One cool property of such a large data-set (~200TB) is that you are pretty sure to see a lot of the same data repeated.
It'd be neat to try reduce the overall data footprint by assigning some sort of signature to repeats. I.e. if you are storing the sequences (let's imagine for a sec that these are non-trivial sizes) ABC, ABCD, ABCDE and ABCDEF , replace ABC with say #1 etc. and perhaps save a whole lot of space.