

Using Node.js and JQuery to Crawl Public Tweets - BenjaminCoe
https://github.com/bcoe/birdeater

======
hafabnew
From the docs:

'''

* Node.js [...]

* jQuery [...]

[...]

This approach has become my hammer when web scraping tasks come up.

'''

If all you have is a hammer, you may find yourself noticing that objects
become more nail-like :).

------
wskinner
I have also found node+jQuery an effective web crawling combination. In
particular the cheerio library <https://github.com/MatthewMueller/cheerio>
greatly simplifies data extraction. And as others have mentioned, the
asynchronous nature of node is perfectly suited to crawling (as long as you
take care not to accidentally DDOS the target site).

------
latchkey
If you really want to scrape pages, you should use something like
<https://github.com/chriso/node.io/> which batches things in jobs, helps with
error handling, io, etc...

------
blyxa
why not use the twitter api?

~~~
bdreadz
from the github page: Birdeater does not use Twitter's API. It was built as a
demonstration of an approach I like to use for parsing structured information
from unstructured HTML.

~~~
TazeTSchnitzel
A better (and practical) example is scraping an internet forum (I've done it,
partially)

------
danso
Does Node have anything like Mechanize? Handling cookie state and such is
something that is much more useful than the selector functionality of
jQuery...which is great, but not any better than what Nokogiri offers.

~~~
laughinghan
<http://zombie.labnotes.org/> is a library I've used with great success. The
documentation in particular is cute.

I found PhantomJS unnecessarily convoluted for trivial tasks and was unable to
figure how to do the nontrivial thing I was actually trying to do. The
documentation in particular was unusable.

