Hacker News new | past | comments | ask | show | jobs | submit login

Oh, I see they recommend writing WARC files with wget using `--no-warc-digests`. You really should not do that - one, it's just a sha1 and neither costly in terms of CPU nor storage. Two, the digest is used to create revisit records for de-duplication. If you disable that you or someone else might end up with lots of duplicate resources on re-crawling.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: