

Ask HN: Techniques for compressing web-crawled data? - int3

I'm currently writing a scraper, which I intend to run periodically to look for certain trends / changes. I realize that most of the time this data will be minimally changed from one scrape to the next. It seems like there is a good opportunity for compression here.<p>I'm thinking of using git, with its packfile format, for the task. However, it seems a little clumsy, and I'd like to find out if you guys have come up with other solutions.
======
wmf
[http://feedblog.org/2008/10/12/google-bigtable-
compression-z...](http://feedblog.org/2008/10/12/google-bigtable-compression-
zippy-and-bmdiff/)

There's also an interesting discussion at: <http://rusty.ozlabs.org/?p=81> You
can get better compression by organizing your data such that similar data
falls within the compression window, or if you can't reorganize data you can
build a custom compressor that specifically looks at older versions of a page
for matches (I guess that's like git).

