One of the biggest challenges I've faced in scraping data has always been that most websites are now blacklisting almost all datacentre IPs including Amazon, Azure, etc blocks and if you really need to have anything useful out of it the only way to do it is using residential IP addresses (which are most often super expensive and also often times shady, think sdk in a mobile game proxying your traffic shady)
It almost makes me feel that I am breaking the law when scraping a site, yet web scraping is on of the most basic programming things.
Just imagine where Google would be if it was a new startup and an existing giant like Cloudflare or Cisco blocked all attempts of access.
> It almost makes me feel that I am breaking the law when scraping a site, yet web scraping is on of the most basic programming things.
Yeah, same for me.
Regarding the denylisting, I guess it depends on what is being scraped and how often the scraping happens?
I'm maintaining a remote jobs aggregator website and I've never been blocked before (but I'm not scraping more than ~5 times per day the same web page).
And with a caching strategy, I think that even a scrape-as-a-service API like the one I'm building in the article should be "kinda" safe (besides edge cases that brute force the cache constantly, like by adding random query-params)?
A lot of websites are using cloudflare which does make scraping quite difficult (just by default I think).
Spoofing your user agent is a must if you need to do anything nowadays.
To your second point that same would be applicable to Google and Bing and any other search engine. Even if you follow robots.txt and consume an equal or lesser bandwidth it does not matter much if you aren't an established player.
Honestly I'm only starting to accept how stupid it is that we call datacenter services "cloud" now, I just can't bear the stupidity of calling running script on a server "serverless function".
Having felt the pain of maintaining my own instances, deploying, updating, dockering, when all I wanted was to have a function deployed out there somewhere I was stoked when serverless functions were announced at AWS Re:Invent. I never felt it was a stupid name because I was done with the servers and all their annoyances. FaaS (functions as a Service) could be interpreted as a dumb name with the same logic we’ve been executing functions the whole time!
It almost makes me feel that I am breaking the law when scraping a site, yet web scraping is on of the most basic programming things.
Just imagine where Google would be if it was a new startup and an existing giant like Cloudflare or Cisco blocked all attempts of access.