

Using jQuery and node.js to scrape html pages in 5 lines - Ainab
http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs

======
drats
Strange popup: "Hello, i see you are coming from hacker news.

the article you clicked on was most certainly not submitted by nodejitsu.

news.ycombinator has a long history of squashing articles and submitters that
aren't funded by y-comb.

most of this is done through their "silent" banning and censoring mechanisms,
that leave people not even realizing they have been silenced.

i hope you enjoy this article, and remember that HN is extremely biased and
that you should keep your horizons broad."

While I would agree that HN is bias towards YC-funded projects I would not
agree that it is biased against non-YC projects or news. In fact, the majority
of the items on HN are non-YC. This also follows for submitters and commenters
for the year or more I've been here.

On a different note. Hpricot is not representative of Ruby scraping anymore -
nokogiri (<http://nokogiri.org/>) is where it's at. Which has a Hpricot
translation layer if you need to change. Even when I decided to solidify on
Python for everything else I will still go back to Ruby just for nokogiri when
it comes to scraping.

~~~
shadowsun7
Marak, the guy behind the Nodejitsu (and, presumably, the popup message) is
known to exhibit asshole behaviour. (See:
<http://news.ycombinator.com/item?id=1448309>) The popup message is consistent
with what HN knows of him.

Whether being a jerk justifies banning I can't say - but his assertion that HN
is biased has little justification (particularly when you consider that the
writer _himself_ is biased.) Kindly ignore.

~~~
thaumaturgy
He seems to be about the same sort of asshole as a lot of other internet
personalities (Theo comes to mind). i.e., he's good at what he does,
knowledgeable, and he doesn't take the time to play nice with others.

It kinda comes down to what sort of "thing" HN is supposed to be -- whether it
has a place for people like him -- but regardless, he's certainly not exactly
hurting over his inability to play in our little sandbox here.

~~~
supporting
But he's actually not knowledgeable or good at what he does -- if you view his
Github page, you'll see that his projects are 100-line thefts or wrappers
around other people's work.

He's been a major drain/drag on the Node.js community, and makes the IRC
channel a toxic wasteland for a good part of the day.

He trolls other sites as well... <http://www.youtube.com/watch?v=IrkDqh9ZVog>

~~~
mnutt
Say what you will about his interactions with people, but "not knowledgeable
or good at what he does" is an unfair characterization.

I find his hook.io project (<http://hook.io/about.html>) particularly
interesting, and the BDD testing app he was building for Node Knockout has a
lot of potential as well.

------
robinduckett
Hey guys. The Nodejitsu team and Marak (<http://www.github.com/Marak>), the
guy behind Nodejitsu are perma-banned from HN and can't respond to your
queries.

He sends his regards, and if you'd like to contact him visit the #Node.js IRC
channel @ Freenode

------
il
I have a question: Does scraping like this execute Javascript on the scraped
page? Am I able to access the output of Javascript/AJAX on that page?

As far as I know this is impossible with any other server-side scraping
technology.

If so, that would be amazingly useful for a couple of my side projects, much
easier than parsing their Javascript code and extracting the info I need.

~~~
robinduckett
You'd have to parse the page seperately and run each piece of in line scripts
/ linked scripts in a sandbox which can talk to jsdom, but it could be done.

~~~
_delirium
Sounds like a useful general-purpose library someone could put together:
server-side execution of HTML-embedded js, with some sort of configurable
sandbox (e.g. decide whether you want to let it call out to the internet or
not). All the components are available, but seems like the field is open for
an all-in-one solution that gets the defaults and edge cases right.

I imagine Google has something at least close to that internally, given that
they've dropped hints here and there about their ability to crawl post-js-
processed pages, but unless I've missed it, I don't think they've released
anything (my guess is because of the arms-race issue, with shady sites trying
to use js to mask certain content from googlebot).

~~~
il
Second this idea. Nothing like this exists currently except for a couple buggy
poorly documented dead projects.

Selenium might be OK for usability testing a single site, but it's useless for
large scale multithreaded crawling and scraping and doesn't scale.

I, for one, would definitely pay good money for an easy library/API to scrape
JS-heavy sites without the overhead of a full browser running macros.

------
fmw
The article lists BeautifulSoup as the Python choice for scraping, but that
isn't necessarily true. I'm using <http://scrapy.org/>, for example, which is
a scraping framework that uses lxml and xpath by default.

~~~
bmelton
I don't know if it was edited after your post, but it lists Scrapy right next
to Beautiful Soup.

------
fizx
This reminds me, I ported the core ideas of the parsley scraping language to
jQuery.

<http://github.com/fizx/pquery#readme>

------
knowtheory
The article reads "The challenge with using these libraries is that they all
have their own quirks that can make working with HTML, CSS and Javascript
challenging."

And that's true only if you only want to do page manipulation in Javascript.
I'm perfectly happy with my page manipulation in Ruby w/ Nokogiri. Here's an
example:

(code formatting on HN sucks, so it's on my blog, apologies)

[http://blog.knowtheory.net/post/1074676060/xml-
manipulation-...](http://blog.knowtheory.net/post/1074676060/xml-manipulation-
in-6-lines-of-ruby)

------
tcarnell
Interesting, when I built <http://cQuery.com> (Content Query Engine), I
investigated a number of options html parsing and content extraction options.
I had played with Rhino and John Resigs env.js
(<http://ejohn.org/blog/bringing-the-browser-to-the-server/>) to run jQuery
server-side.

For portability, performance and flexability I finally settled for writing my
own HTML parser and CSS selection engine from scratch.

------
forsaken
Site appears down. Is node popular enough yet for the "Node doesn't scale"
talk? :)

~~~
drats
Yes, as cliched as it is, I think it's time. I couldn't use at least 6/10 of
the node challenge top 10 when it hit the HN front page (and the rest were
beset by bugs and didn't work - the pixel one where you form characters
stopped showing the shape I was supposed to be trying to get into after a few
rounds, and the robot war one never let me buy or release my wave of robots on
Chrome or Firefox). Overall it was totally disappointing experience.

~~~
mnutt
I think there are a couple of factors here:

1\. It was a 48-hour coding competition, so some bugs are to be expected.

2\. The hosting for the apps in this competition was turned on literally the
day before: <http://twitter.com/joyent/status/22204412477>

I don't think that "node can't scale" will get much press, just because node
is pretty performant and many people conflate performance and scalability.
Node's asynchronous api would seem pretty good in a services-oriented
architecture, and scalability is mostly about how the app is architected. (and
in 48 hours, that probably means "quickly")

On the other hand, node.js is still a very young platform and there are a lot
of unsolved issues. As of right now there are only 3 or 4 dedicated node.js
hosts that I know of, and all are in private beta. People are still trying to
figure out best practices for design and testing.

The thing that I took away from Node Knockout was that here are a bunch of
really cool web app ideas that, given time, could be built with node. I think
the apps were 'frozen' pending judging, but I know many of the top app teams
were planning on improving them as soon as they had the chance.

------
jfager
Ignoring the drama: my current favorite scraping combo is NekoHtml underneath
Scala's completely kickass combo of pattern matching and XML literals.

