

Ask HN: What set up do you use to scrape web pages - yalogin

I am writing code to get URL data and realized because of JS using a regular requests module in python doesn't cut it. I need to actually render the page in a browser engine and get the page from there. Googling did not bring up a fixed/common way to do it. I was wondering this must be something a good number of start ups should be doing, so wanted to pick some brains. Thanks for any suggestions.
======
lazyfunctor
<http://nrabinowitz.github.io/pjscrape/> <http://phantomjs.org/>

if python is not a strict requirement

------
garlandbinns
Maybe give this a read if you haven't already:
[http://blog.databigbang.com/web-scraping-ajax-and-
javascript...](http://blog.databigbang.com/web-scraping-ajax-and-javascript-
sites/)

Good luck.

~~~
yalogin
Thanks for the link. I saw that. I was hoping to find something using webkit.
I found dryscrape but wasn't sure how popular it is. Also I was trying to find
out what the standard practice is. Do people build it to suit their needs so
that they can optimize it for scale and performance?

The scraping need seems generic enough to me to create a library or tool
similar to dryscrape. The only problem is dryscrape is specifically for linux.
I was hoping to to find some cross platform solution so that I can have it on
my dev machine before moving it to the linux server.

