
Ask HN: How to Web Scrape in 2020? - alephnan
Are there particular libraries or scraping-as-a-service UIs you would recommend?<p>I&#x27;m particularly interested in restaurant reviews website which has been an increasingly detestable company over the years.
======
marcell
scrapy for python is pretty good, check it out.

In most cases getting banned is the big issue. The bigger the site, the more
advanced their bot detection is. You can use luminato.io to get residential
and mobile IP's, but it's pricey.

Some sites will also obfuscate the DOM, ie. removing classnames and ID's,
which complicates the data extraction.

[http://scrapinghub.com/](http://scrapinghub.com/) has a paid "do it for me"
service, which may be an option depending on your budget.

~~~
austincheney
Here is a tiny DOM walking script that evaluates all text semantics in a page
demonstrating that you don’t need identifiers in the code.

[https://github.com/prettydiff/semanticText](https://github.com/prettydiff/semanticText)

------
mtmail
Related from 2 month ago "Ask HN: What's state of the art for screen scraping
these days?"
[https://news.ycombinator.com/item?id=22148803](https://news.ycombinator.com/item?id=22148803)
where [https://simplescraper.io/](https://simplescraper.io/) was recommended

------
krageon
For avoiding bans, having a large ipv6 range can help (e.g. like one you might
get with a VPS at a proper hosting company). As for grabbing the content
itself, I've used a lot of frameworks but I usually end up back at some
combination of simple string search and regex.

------
jamil7
Depending on what you're scraping you might run into a fair few JS-Only
websites that are a pain to scrape. On top of all the things mentioned here
you will need to run pages through a headless browser like puppeteer. For
these sites you maybe be able to reverse engineer their APIs and attempt to
scrape those rather than the pages themselves.

------
ariosto
python + beautiful soup

