Hacker Newsnew | comments | show | ask | jobs | submitlogin

CommonCrawl also has a fairly large ("The crawl currently covers 5 billion pages") dataset of this sort, which unlike the one from archive.org is already available to everyone on S3 under the requester-pays model.

http://commoncrawl.org/data/accessing-the-data/




Applications are open for YC Summer 2015

Guidelines | FAQ | Support | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: