

HTML/XML Parsing with Node & jQuery - mjijackson
http://alexmaccaw.co.uk/posts/node_jquery_xml_parsing

======
pshc
I was scraping with jQuery for a while but it felt like an awful lot of
overhead. In the case of simpler scraping tasks that happen a lot I've
actually gone back to nuts and bolts with HTML5[1]'s tokenizer and a custom
state machine that only accumulates the data I want. At no time is any DOM
node actually created in memory, let alone the entire DOM tree. It means I
feel safer running many of these in parallel on a VPS. It also means I can
write a nice streaming API where you start emitting data the moment you get
enough input. Buffering input just feels wrong in node.js.

But jQuery is a great scraper if your transformation is complex and non-
streamable. [1] <https://github.com/aredridel/html5>

------
ricardobeat

        doc.find('h2:gt(0)').before('<hr />')

------
peteretep
Actually, I'm doing this for my SUPER SECRET startup at the moment. Originally
the front-end would just send the back-end the whole HTML of a user's page
when they executed the browser plugin, and the back-end would intercept it and
knock it up in Perl.

Wasn't sure how well that was going to scale, and was worried people would get
weird about sending the entire contents of the page they're on - I have a 90%
working solution now where it's all done in-browser, with a bunch of classes
I've been working on with a node.js set of testing tools

------
bialecki
One of my biggest pet peeves with crawling the web is using XPath. Not because
I have strong feelings about XPath, just that I use css selector syntax so
much, it's a pain I can't leverage that knowledge in this domain as well.
Something like this is really awesome and going to make crawling the web more
accessible.

~~~
cosmic_shame
if you're using python, lxml has a cssselect module that makes this a breeze.

~~~
bialecki
Very interesting, I'll definitely look into that. Thanks!

------
orc
Wow, I was just thinking this morning how awesome it would be to make a
desktop app that could crawl websites with jquery. And since node.js has a
windows installer, it sounds like a much better solution than the C#
HtmlAgilityPack I've been using.

~~~
orc
Hm.. I tried doing this on windows but it turned out to be a lot of work to
get it setup correctly. npm is hard to install on windows, and the jquery
project depends on contextify, which has a binary. It does have a windows
build though: <https://github.com/Benvie/contextify>

------
slashclee
Apparently node.js doesn't implement the DOMParser object, which means that
you can't actually use jquery's parseXML method. That's a bummer :(

