Hacker News new | past | comments | ask | show | jobs | submit login

What if you're scraping with ajax? Wouldn't each individual user's IP take the hit and not the domain's IP?



You can't scrape with AJAX because of cross domain security restrictions.

One potential solution to obey robots.txt might be to spawn multiple small EC2 instances with different IP's and have them coordinate with each other to share the crawling without individually running over the limits. (This is also useful for scraping from sites that have rate limits)


robots.txt doesn't enforce itself so there is no IP limitation; this is still a violation and no better than simply lowering the delay on a single scraper.


The ajax request looks to be getting proxied through this guy's server. You have to do something like that because of cross domain ajax issues




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: