
Ask HN: Best way to scrape web content? - tmaly
What is your go to library or framework that you use to scrape web content?<p>I am looking to scrape from content that may or may not have JavaScript in the page, and I am looking for something with good documentation that does not have a steep learning curve.
======
assafmo
Mostly using linux tools.

First of all, always respect the site's terms of use and the server's
capacity. It is tempting to just do parallel -P 200 but it can hurt the
website and can lead to you being throttled/baned.

cURL. Chrome dev tools -> Network -> Copy as cURL. also very easy to customize
requests. sessions can be done using --cookie and --cookie-jar. Using tor
(sudo apt install tor) with --socks5 localhost:9050 or --socks5-hostname
localhost:9050 is great for scraping .onion websites and for anonymity. The
holy grail for me is to find a JSON REST API for the website I am scraping.

lynx --stdin --dump to extract text if there isn't any interesting stuff
inside the html/js (man lynx! it has some great options)

awk to extract data. very easy to extract data from tables! to extract data
from html I usually use awk -F '[<>]' or awk -F '"'... depends on where the
data is in the html. (also sometimes egrep -o to extract with a regex)

seq 1 10 | parallel curl
[https://banana.papaya.com/?page={}](https://banana.papaya.com/?page={}) is
great for paging.

jq to process JSONs. I usually output to csv with @csv, because then I can use
sqlite for quick queries or any other DB that has an easy "import from csv"
option (I like using CouchDB with couchimport).

Dynamic pages - You can always get the content using the same request the that
site's javascript did. I never find myself needing to execute the page's
javascript.

EDIT: typos, grammer

~~~
tmaly
Thanks for the Chrome dev tools tip, I did not even know they had that.

lynx brings back memories when all I had was a green screen vt100 terminal in
the university library.

------
btschaegg
For simple, scripted, one-off tasks, I've had good experiences with Beautiful
Soup for Python[1].

If it's necessary that the javascript actually gets executed though, you're
more likely to need something like headless chrome (recently on HN [2]) or
PhantomJS [3], but I can't speak to how easy they are to use, though.

[1]:
[https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

[2]:

[https://news.ycombinator.com/item?id=14101233](https://news.ycombinator.com/item?id=14101233)

[https://news.ycombinator.com/item?id=14239194](https://news.ycombinator.com/item?id=14239194)

[3]: [http://phantomjs.org/](http://phantomjs.org/)

~~~
tmaly
Thanks, I would love to hear more about the headless chrome if someone can
chime in.

------
iamsvera
If you know which requests you need to do in order to get the info you want
instead of waiting for the headless browser to execute the javascript code
that renders the needed data, it is pretty simple, you just need an http
client and something to parse the content, depending if is html I'd recommend
using cheerio in NodeJS to parse and traverse the content with a jQuery like
API.

If you must scrape the content with a headless browser, I'd choose Selenium.

[https://github.com/sourcegraph/go-
selenium](https://github.com/sourcegraph/go-selenium)

~~~
iamsvera
There is a really cool and easy to use library for NodeJS, but it doesn't use
a headless browser :( it just makes the requests and parses the content.

[https://github.com/IonicaBizau/scrape-
it](https://github.com/IonicaBizau/scrape-it)

------
zoobab
I use python selenium with Firefox in headless mode.

It can render or not the javascript.

~~~
tmaly
Do you have an example of any blog posts on how to get it setup?

~~~
zoobab
I will try to post something.

