Hacker News new | past | comments | ask | show | jobs | submit login

Looks interesting, and thank you for sharing this! One common issue with scraping web pages is dealing with data that is dynamically loaded. Is there a solution for this? For example, when using Scrapy, you can have Splash running in Docker via scrapy-splash (https://github.com/scrapy-plugins/scrapy-splash).



Thanks! As mentioned in another comment, currently there is no build in support for this yet.

As a workaround one could use a service like ScrapingBee (not affiliated) as a proxy, that renders the page in a browser for you.

Surely, relying on a service for this is not always ideal. I am also working on a small wrapper that turns Chrome into an HTTPS proxy, which you could plug right into flyscrape. Unfortunately it is very experimental still and not public yet. I have not yet decided if I release it as part of flyscrape or as a separate project.


Can't you load the URL that is being dynamically loaded directly within your scraper?


Not only can you, in my experience it is substantially less drama and arguably less load on the target system since the full page may make many many other requests that a presentation layer would care about that I don't

The trade-offs usually fall into:

- authing to the endpoint can sometimes be weird

- it for sure makes the traffic stand out since it isn't otherwise surrounded by those extraneous requests

- it, as with all good things scraping, carries its own maintenance and monitoring burden

However, similar to those tradeoffs, it's also been my experience that a full page load offers a ton more tracking opportunities that are not present in a direct endpoint fetch. I mean, look how many "stealth" plugins out there designed to mask the fact that a headless browser is headless

But, having said all of that: without question the biggest risk to modern day scraping is Cloudflare and Akamai gatekeeping. I do appreciate the arguments of "but ddos!11" and yet I would rather only actors that are actually exhibiting bad behavior[1] be blocked instead of everyone trying with a copy of python who have set reasonable rate limits

1 = this setting aside that "bad behavior" can be defined as "downloading data that the site makes freely available to Chrome but not freely available to python"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: