
Colly – Scraping Framework for Golang - tampo9
https://github.com/gocolly/colly
======
stablemap
Some discussion from two months ago:

[https://news.ycombinator.com/item?id=15408784](https://news.ycombinator.com/item?id=15408784)

~~~
dang
Thanks! Missed that earlier.

------
dguaraglia
I'm always surprised by how many web scraping frameworks/libraries I see
sprout here on HN on a regular basis. Is web scraping something people are
doing, or is web scraping the new high-concurrency version of the "to do list"
utility everyone used to write as an exercise?

This is an honest question, I'm not trying to take a dig at anyone in
particular.

~~~
CyberShadow
Just yesterday I wrote a program to scrape some Amazon search and product
pages [0].

Why? Because Amazon's search is outright broken. The number of results changes
when you change sorting mode, and sometimes, sorting by a different criteria
will just serve you a "no products found" error page.

I'll generally write a personal product comparison program when it becomes
clear that I can't be certain that I can find the best product by hand. Often,
even specialized websites that should have parametrized product
search/filtering don't have their data properly indexed, so you have to scrape
and parse it yourself. Another reason is to cross-reference with data from
other sources. E.g. what laptop can I buy that has the best single-threaded
CPU performance (within some other restrictions) [1]?

[0]: [https://github.com/CyberShadow/choose-
product/blob/master/am...](https://github.com/CyberShadow/choose-
product/blob/master/amazon/common.d)

[1]: [https://github.com/CyberShadow/choose-
product/blob/master/le...](https://github.com/CyberShadow/choose-
product/blob/master/lenovo/choose.d)

~~~
dguaraglia
Dammit, I see D code and I want to learn the language so badly. I just need to
find the right project for it (I might have something in the pipeline, as
there's some stuff I need to do with libav and I really hate the idea of using
C++.)

------
mlevental
>Lightning Fast and Elegant Scraping Framework for Gophers

the bottle neck in scraping is never the parsing/DOM representation/traversal.

~~~
tampo9
Good performance matters if you have decent networking infrastructure or your
server has limited resources.

Bandwidth and IP limits are the most common bottle necks, but these can be
solved using multiple proxies and ssh tunnels. Colly has built in support for
switching proxies [1].

[1] [http://go-colly.org/docs/best_practices/distributed/](http://go-
colly.org/docs/best_practices/distributed/)

~~~
mlevental
the limitation is usually rate throttling on the page

>but these can be solved using multiple proxies and ssh tunnels. Colly has
built in support for switching proxies

interesting

>your server has limited resources.

possibly.

------
blowski
The obvious question - why would I use this over Scrapy?

~~~
sheraz
Here here. The right tool for the right job. And I can't think of a "righter"
tool for this kind of job.

Edit - not picking on you, but given the quality and ecosystem of libraries
and ancillary tools for scrapy, I don't even consider alternatives at this
point. Good on anyone who does it to learn but for actual workloads I won't
consider anything else.

------
fiatjaf
For DOM parsing I cannot imagine that there could anything better than
[https://github.com/PuerkitoBio/goquery](https://github.com/PuerkitoBio/goquery).

~~~
jjuel
Which is funny because if you look at the code this is using goquery. Which
then makes you wonder why would I use this when I can just use goquery?

~~~
WaltPurvis
Because goquery is _only_ a DOM parser/manipulator and not a full-fledged (or
even half-fledged) "scraping framework"?

------
Xeoncross
Please breakup your main `colly.go` file into separate parts. If possible you
shouldn't have a 30 line imports definition covering everything from cookies
and regex to html and sync access.

Make sure to use DNS caching on the box else add it in Go.

Colly only supports a single machine via map of visited URL's. Would be great
if you replace with a queue like redis or beanastalkd.

    
    
        visitedURLs map[uint64]bool

~~~
fiatjaf
Please don't follow this suggestion. It's very helpful and healthy to have
everything in a single file if you consider that manageable, so no problem at
all.

~~~
JepZ
Well, in Go all files in one directory are part of one package (best practice)
and files within the same package do not have to import each other to have
access to other functions. Therefore, breaking the package into several files
is common practice.

A sane approach to this is for example to create a separate file for each type
(Collector, HTMlElement, Request, Response, ...) and its attached
functions/methods.

------
kondro
How does this go with running the JS on the SPAs that make up a large portion
of the web today?

