Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

but the size of the problem in total bytes may be more of a problem than the actual looking up and indexing of the data.

One cool property of such a large data-set (~200TB) is that you are pretty sure to see a lot of the same data repeated.

It'd be neat to try reduce the overall data footprint by assigning some sort of signature to repeats. I.e. if you are storing the sequences (let's imagine for a sec that these are non-trivial sizes) ABC, ABCD, ABCDE and ABCDEF , replace ABC with say #1 etc. and perhaps save a whole lot of space.



This is also known as the ZIP compression algorithm.


But then you might end up not being able to see the forest of data for all the Huffman trees.


It's called "deduplication" in databaseland, it's common strategy for tape backups.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: