Hacker News new | comments | show | ask | jobs | submit login

CommonCrawl also has a fairly large ("The crawl currently covers 5 billion pages") dataset of this sort, which unlike the one from archive.org is already available to everyone on S3 under the requester-pays model.

http://commoncrawl.org/data/accessing-the-data/




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: