
Osmosis: Web scraper for Node.js - tombenner
https://github.com/rc0x03/node-osmosis
======
watson
I'm puzzled why the author highlights "Lightweight: no dependencies" as a
strength - in node.js land this is not the case in my opinion. I'm happy to
hear people's view on this

~~~
risyasin
Well, you are right about nodejs land & what it means being lightweight in
that land. Frankly it looks like hell when you have to deal with not only
packages also with their semantic versions. but being another nodejs crawler
library owner (Arachnod: Web Crawler for Node.js
[https://www.npmjs.com/package/arachnod](https://www.npmjs.com/package/arachnod)),
i can assure you that Developer has a point about being lightweight & no
dependency of his package. When i had started to code a crawler with nodejs i
had to deal with many problems (i believe amount of problems may be less for
other common languages)

Also i haven't tried it for a long shot for example to make it work more than
millions of webpages, but "memory leak free" is a really strong claim which
has to be tested first.

~~~
watson
I'm interested to hear what you think is problematic when building a web
crawler in regards to dependencies? Is it specific to the DOM parsing?

~~~
fapjacks
I've toyed with web scraping in Node, and the answer is definitely parsing.

------
matthewmueller
If you need request delays, executing JS on the page, pagination or deep
object schemas, you may also consider x-ray:
[https://github.com/lapwinglabs/x-ray](https://github.com/lapwinglabs/x-ray)

------
lintuxvi
Can this run js live on the page?

~~~
arielm
There are other ways to run live js on a page. It just has to do with how you
load the pages. If it's just an http request to get the body it won't work,
but using a headless browser will do the trick just fine, and without too much
async headache.

------
dijs
I wrote something similar
[https://github.com/dijs/parsz](https://github.com/dijs/parsz)

------
jlas
Scraping in Node.js is just not worth it. IMHO the asynchronicity really gets
in the way of building a scraper.

~~~
watson
I don't know about libxml, but normally you'd not need to do any I/O once
you've gotten hold of the raw HTML - so there should be no need for callbacks.
E.g. with cheerio you can parse HTML synchronously.

------
_RPM
"Fast: uses libxml C bindings" Where exactly are these located?

~~~
watson
He depends on the module named libxmljs, which contain the C code:
[https://github.com/polotek/libxmljs](https://github.com/polotek/libxmljs)

------
curiousjorge
why you would write web scraper using asynchronous javascript beats me. what
is the gain?

~~~
richmarr
the ability to scrape/crawl pages with interactive elements without writing
tons of threading code, or limiting your crawl rate to the number of threads
you can handle.

