
Ask HN: How to approach writing a web crawler that can handle JavaScript? - awclives
Greetings,<p>I am interested in writing a web crawler that can handle JavaScript, i.e. that can access the DOM after any JavaScript has run.<p>I recognize that this could get arbitrarily complicated; however, I wanted to know whether anyone had any obvious pointers.  There do seem to be some nice Java--my preferred language--crawlers out there, i.e. https:&#x2F;&#x2F;github.com&#x2F;yasserg&#x2F;crawler4j.  However, they of course do not handle JS.<p>Is the standard approach to handling JS to use something like Selenium?  i.e. load the page in a browser and then pull the DOM into the crawler for processing?<p>Thanks.
======
T-A
I would probably use node-horseman [1], but then what do I know?

[1] [https://github.com/johntitus/node-
horseman](https://github.com/johntitus/node-horseman)

~~~
awclives
I'll check it out. Thank you!

------
therealgimli
You want something like phantom.js

[http://phantomjs.org/](http://phantomjs.org/)

There are other ones that do well also, but phantom will likely do anything
you need to do.

