The trick is, of course, that it's nearly impossible to predict what will be useful to someone ahead of time. While you can probably sort out some of the spam, a comprehensive archiving project should probably avoid false positives when throwing things away.
Seems like a hard problem to solve. The low-hanging fruit would probably be detecting duplicates and combining them, which loses redundancy but handles all of those identical landing pages.
Seems like a hard problem to solve. The low-hanging fruit would probably be detecting duplicates and combining them, which loses redundancy but handles all of those identical landing pages.