Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Would tarsnap be a solution for long-term archival of logfile data? I'm working on a data mining project of the "Let's store everything & figure out what we do with it later" type. My servers generate about 2GB of data (zipped) every day. We plan to store an 'analysis' dataset of the last 3 months on S3 and run a batch of Hadoop/Pig/MapReduce jobs every night on EC2.

My question: what would be the most cost-efficient long-term archival solution (I can live with slow access-times) of Apache logs? Does tarnsap offer any benefit here? Are there any compression solutions specific for Apache logs? Other ideas?



Two points.

First, this is not that much data (~180GB). Is there a particular reason not to just throw it on a hard disk on some machine that doesn't do too much during the night and write a trivial Perl script?

Secondly, (g)zip may not the best solution here. A quick unscientific test on ~3MB of Apache log data (in the default Common Logfile format): gzip or zip produce ~240KB of data, xz (formerly lzma) gets it down to ~80KB (using -9e) or ~96KB (using the default option).

In my quick unscientific test, xz can decompress data about half as quickly as gzip and about ten times faster than bzip2. It's very likely able to keep up with your disk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: