The issue is, as the above thread shows, that the creator of node-htmlparser does not want to support illegal HTML. Unfortunately that is not realistic.
It is a scraping/crawling tool suite. Base it on webkit, with good scriptable plugin support (not just js, but expose the DOM to other languages too). It would consist of a few main parts.
3) The spidering engine is what determines which pages to point the scraping engine at. It can be fed by external code, or it can be fed by scripts from the scraping engine, a feedback mechanism (some links on a page may be part of a single scraping, some may just be fodder for a later scraping). It can be thought of as a work queue for the scraping engines.
The main use-cases I see for this are specialized search and aggregation engines, which want to get at the interesting bits of sites which don't expose a good api, or where the data may be formatted, but hard to semantically infer without human intervention. Sure, it wouldn't be as efficient from a code execution point of view as say, custom scraping scripts, but it would allow for much faster response times to page changes, and allow better use of programmer time, by taking care of a lot of boilerplate or almost-boiler plate parts of scraping scenarios.
It's true that this process won't render the page-w-ajax as your browser will, but I've found that if you do some web inspection of the page to determine the address and parameters for the backend scripts, then you don't even have to pull HTML at all. You just hit up the scripts and feed them parameters (or use Mechanize, if cookie/state-tracking is involved).
Perfectly fine use of css selectors and more
Isn't Phantomjs already perfect for scrapping, what is the advantage of this exactly?
And of course you can scrape interactive sites - interactive sites still basically just use HTTP to request data. Just watch the network window in Chrome's developer tools, and figure out what HTTP requests the site is making that you are interested in. Then code them into request() calls.
Also worth noting that jsdom is quite slow, and has a strict HTML parser. If you want something faster that will cope with more web pages, look up cheerio.
Browsers sole purpose is to run web apps. It doesn't make sense to use anything else.
With a pure Nodejs scraper we can run over 1000 parallel sessions per CPU.