*but the size of the problem in total bytes may be more of a problem than the ac... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		jaydub on May 8, 2009 \| parent \| context \| favorite \| on: What database would you use to store (min) 10 bill... but the size of the problem in total bytes may be more of a problem than the actual looking up and indexing of the data. One cool property of such a large data-set (~200TB) is that you are pretty sure to see a lot of the same data repeated. It'd be neat to try reduce the overall data footprint by assigning some sort of signature to repeats. I.e. if you are storing the sequences (let's imagine for a sec that these are non-trivial sizes) ABC, ABCD, ABCDE and ABCDEF , replace ABC with say #1 etc. and perhaps save a whole lot of space.

planck on May 8, 2009 [–]

This is also known as the ZIP compression algorithm.

webignition on May 8, 2009 | | [–]

But then you might end up not being able to see the forest of data for all the Huffman trees.

gaius on May 8, 2009 | | [–]

It's called "deduplication" in databaseland, it's common strategy for tape backups.

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact