

Show HN: Scraperjs – A versatile web scraper - ruipgil
https://github.com/ruipgil/scraperjs

======
brianzelip
It's unclear to me how to actually run this. Only executing the two commands
listed under the Installing section does not run it - I had to `cd` into the
scraperjs dir, then `npm install`, then continue with the second Install
command (`grunt test`) to actually test.

Also, do you install scraperjs into each project directory you want to use it
for? Or just install it once?

~~~
ruipgil
Scraperjs is supposed to be used as npm package. So, if you do "npm install
<package-name>", you download the latest version of package to the same folder
as the closest package.json file (if there's none it will go to your ~/
folder). At that point you can just use with "require('scraperjs')". The test
part is a bit more foggy, and I'll add more information to the README in due
time. To test you've got to npm-install, with the save-dev flag (npm install
--save-dev scraperjs), it will also add the package to your development
dependencies, this is so that people that want to use the package won't need
to download all scraperjs' development dependencies.

For more information about npm install: [https://www.npmjs.org/doc/cli/npm-
install.html](https://www.npmjs.org/doc/cli/npm-install.html)

------
jasode
It would be helpful if the documentation compared how Scraperjs is different
from, or better than, CasperJS for scraping. CasperJS is the older and more
well-known wrapper around PhantomJS so comparisons would help people decide
what the appropriate tool would be.

[http://casperjs.org/](http://casperjs.org/)

~~~
thibauts
As far as I can see it has not much in common with casperjs, apart from the
fact that it can use phantomjs.

~~~
jasode
If you mean that the syntax is different, yes, I get that.

CasperJS can also scrape dynamic websites. What criteria would someone want to
use ScraperJS instead of CasperJS for that task? Are there features in
ScraperJS that don't exist in CasperJS? Does it take 10x less lines-of-code to
accomplish the same task? Etc.

~~~
ruipgil
For web scraping purposes ScraperJS and ScraperJS would probably use the same
lines of code, however ScraperJS has most of the tools you need for web
scraping, something that CasperJS lacks (that's not their main goal).
ScraperJS is move flexible than just CasperJS. If you want static content just
use the static scraper and get lightening fast results. TL;DR: CasperJS is
great but it's not made for web scraping.

------
halcyondaze
If you're interested in scraping in python, then I recommend giving this a
read: [http://jakeaustwick.me/python-web-scraping-
resource/](http://jakeaustwick.me/python-web-scraping-resource/)

~~~
cridenour
I think Scrapy is a better Python scraping tool.

------
justboxing
This is awesome. I am very new to scraping, so bear with me if this is very
obvious.

Would it be possible to follow a list of URLs from a home page (Ex: List of
Marathon Runners), and then follow the link in their name that goes to their
stats page, and download / save the scraped data as JSON to a text file on the
local machine's C:\Runners\Data\ folder for example?

Also, does anyone know of a reliable and tested C# / .Net / ASP.Net web page
scrapper?

~~~
cwbrandsma
On the second question, Typically a web scraper just interacts with the output
of a web server, is shouldn't matter if it is asp.net or any other system.

~~~
misterbwong
Mostly this. In ASP.NET/C# you're probably looking at using the built in
HttpClient lib [0] and an html parser lib like HTMLAgilityPack [1]. I've used
this combo in the past and am happy with it.

[0] [http://msdn.microsoft.com/en-
us/library/system.net.http.http...](http://msdn.microsoft.com/en-
us/library/system.net.http.httpclient\(v=vs.118\).aspx) [1]
[http://htmlagilitypack.codeplex.com/](http://htmlagilitypack.codeplex.com/)

~~~
justboxing
Thanks!!

------
jdrock
Let us know if you'd like to integrate this with
[http://www.80legs.com](http://www.80legs.com)!

------
andrejewski
If anyone is interested in just scrapping links between webpages with
JavaScript, I made Slinky
([https://github.com/andrejewski/slinky](https://github.com/andrejewski/slinky)).
The API is simple and easily overridable.

------
pibefision
Could someone recommend a similar framework but Ruby based? Just because I'm
more skilled in Ruby than in Node (not for trolling purposes)

I've been exploring Github but could not find a well mantained framework (or
at least updated to last month).

~~~
findjashua
While I haven't tried any, I think if you want to handle dynamic Javascript
content, you'd have to go with a JS library. Feel free to correct me if I'm
wrong.

~~~
riffraff
you can do it in pure ruby with one of the webkit wrappers (i.e.
poltergeist[0])

[0]
[https://github.com/teampoltergeist/poltergeist](https://github.com/teampoltergeist/poltergeist))

------
jwarren
Nice! Could've used that this weekend when I got caught in callback hell
trying to build a simple NodeJS scraper. Ended up doing it in PHP just because
I know it well.

I'll give it another go with this library next week!

------
roux_rc
Artoo is soooo much better :)
[https://medialab.github.io/artoo/](https://medialab.github.io/artoo/)

------
bshimmin
I really like the router aspect of this. That's a nice idea and not (to the
best of my limited memory) one I can recall seeing in any other scraper.

------
mr5iff
I don't quite get the point of the DynamicScraper... Any real use cases for
that?

~~~
jasode
For example, go to [http://www.imdb.com](http://www.imdb.com)

On the right, you'll notice that under the sidebar "Opening This Week" is a
movie titled " _Love Is Strange_ ".

With that in mind, press Ctrl+U (view html source).

Try to search for the word " _Strange_ " anywhere in the source. (It's not
there.) If it's not there, how did it get shown on the screen?!

The answer is that it is "dynamically" loaded. A simple scraper that only
works on a static download of html source won't be able to retrieve that
string. You need web scrapers that can process dynamic pages (execute
Javascript).

Btw, you'll notice that you _can_ find the string "Strange" via F12 (Developer
Tools). That's because the F12 inspector shows the html _after_ the DOM has
been dynamically modified by javascript whereas Ctrl+U does not.

------
novaleaf
if you want a scraper as service, you can try:
[https://PhantomJsCloud.com](https://PhantomJsCloud.com)

disclaimer: i wrote it.

------
woah
Looks pretty good, shame about the promises.

