Hacker News new | past | comments | ask | show | jobs | submit login
Creating a serverless function to scrape web pages metadata (mmazzarolo.com)
31 points by mmazzarolo on June 6, 2021 | hide | past | favorite | 11 comments



One of the biggest challenges I've faced in scraping data has always been that most websites are now blacklisting almost all datacentre IPs including Amazon, Azure, etc blocks and if you really need to have anything useful out of it the only way to do it is using residential IP addresses (which are most often super expensive and also often times shady, think sdk in a mobile game proxying your traffic shady)

It almost makes me feel that I am breaking the law when scraping a site, yet web scraping is on of the most basic programming things.

Just imagine where Google would be if it was a new startup and an existing giant like Cloudflare or Cisco blocked all attempts of access.


> It almost makes me feel that I am breaking the law when scraping a site, yet web scraping is on of the most basic programming things.

Yeah, same for me.

Regarding the denylisting, I guess it depends on what is being scraped and how often the scraping happens? I'm maintaining a remote jobs aggregator website and I've never been blocked before (but I'm not scraping more than ~5 times per day the same web page). And with a caching strategy, I think that even a scrape-as-a-service API like the one I'm building in the article should be "kinda" safe (besides edge cases that brute force the cache constantly, like by adding random query-params)?


> most websites are now blacklisting almost all datacentre IPs including Amazon, Azure, etc

“Most” sounds like an exaggeration. Wouldn’t this also create problems for virtual desktop services like Amazon Workspaces?

> It almost makes me feel that I am breaking the law when scraping a site

You might be violating their copyright, it depends what you do with it. If you overdo it, you could also degrade their service for actual users.


A lot of websites are using cloudflare which does make scraping quite difficult (just by default I think).

Spoofing your user agent is a must if you need to do anything nowadays.

To your second point that same would be applicable to Google and Bing and any other search engine. Even if you follow robots.txt and consume an equal or lesser bandwidth it does not matter much if you aren't an established player.


> You might be violating their copyright, it depends what you do with it.

Google is not new to such complaints (news sites). Everything is very relative.


Honestly I'm only starting to accept how stupid it is that we call datacenter services "cloud" now, I just can't bear the stupidity of calling running script on a server "serverless function".


Having felt the pain of maintaining my own instances, deploying, updating, dockering, when all I wanted was to have a function deployed out there somewhere I was stoked when serverless functions were announced at AWS Re:Invent. I never felt it was a stupid name because I was done with the servers and all their annoyances. FaaS (functions as a Service) could be interpreted as a dumb name with the same logic we’ve been executing functions the whole time!


Considering the alternative would be running your own server, what would you suggest “serverless” be called instead?


People have been running on “someone else’s server” for decades. So, no, that’s not the alternative.


All code must run on a processor somewhere, and if it's not your server it is someone else's server. I also have issues with the term serverless.

However, the term is generally accepted for cloud services where you run code on someone else's server. It is a product you can use.

I would be greatly disappointed if there are developers who think serverless code runs on air.


I think it makes sense when it's simply describing that you don't manage the server.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: