
Colly: Fast and Elegant Scraping Framework for Golang - asciimoo
https://github.com/asciimoo/colly?source=hackernews
======
integrii
You will get to initial source code with this, but if you want JavaScript to
work, you just use PhantomJS. If you want PhantomJS to be usable, you is
casperJS... Until you find a site with FuzeJS or some other JavaScript or html
intense site. Those won't render in PhantomJS.

For that stuff, as of a few months ago, you can use Chrome headless. I wrote a
couple go packages to make that easy. It basically runs headless chrome with a
JavaScript REPL console you can use to interact with the session.
[https://GitHub.com/integrii/headlessChrome](https://GitHub.com/integrii/headlessChrome)

I was also able to smash my while scraper bot into a docker container after
working around a couple bugs.

~~~
stcredzero
_It basically runs headless chrome with a JavaScript REPL console you can use
to interact with the session._

That looks cool! Would I be able to run Node scripts?

~~~
simlevesque
> Would I be able to run Node scripts?

No. Since you can't run Node scripts on Chrome, the same is true for Chrome
headless.

------
afandian
One thing I'm interested in is scraping those annoying sites that require
JavaScript execution. More and more webpages are requiring JS even to display
anything beyond a blank page. These sites self-select themselves for exclusion
in scraping like this.

I've looked into Headless Chrome, but I'd be interested to see a 'scraping
framework' level abstraction for those sites.

~~~
feelin_googley
"More and more webpages are requiring JS even to display anything beyond a
blank page."

Can you provide some example webpages so we can take a look?

Also, I agree with @asciimoo's point about endpoints. One could make an
argument that compared to the website design of the 1990's and 2000's,
retrieving structured data from websites is actually getting easier not more
difficult. I recall one period of time where the trend was to design websites
entirely in Macromedia/Adobe Flash.

Here is what is missing from this project (and many others like it): when
providing software that performs text processing one needs to not only provide
example code but also _example output_. This enables a user to quickly compare
her current text-processing solution with the software being provided without
having to install, review and run the unfamiliar software.

For example, without some sample output she cannot test, e.g., whether her
current solution produces the same output faster or using less code.

~~~
afandian
With a minute's searching, here's a webpage that's meaningless without
JavaScript:

[https://blogger.googleblog.com/](https://blogger.googleblog.com/)

It's arguably a page of text not a web app that needs user interaction.

I'm not trying to start an argument about JS: the consensus on HN seems to be
that if you don't execute JavaScript you don't deserve to read webpages. I'm
just saying that your website's clients are more diverse than 'normal' well-
sighted humans. There may be machines reading the site, for all kinds of
reasons.

And regarding endpoints. One could make the argument that with AJAX we now
have richer APIs. I disagree. We have a well-understood API for getting
hypertext (HTTP) in a well-understood format (HTML) that works/worked across
all websites. Replacing that with a custom-built API for every website isn't
apples-for-apples.

~~~
feelin_googley
The reason I asked is that I rarely use a graphical browser so I am always
curious about wesbites that are inaccessible without Javascript.

Blogger has a feed. Here is one way to retrieve it, in two steps.

    
    
         1. get TargetBlogID
         2. retrieve data
         optional:
         3. format data for viewing
    
         Example:
    
         1.
         x=$(exec curl https://blogger.googleblog.com \
         |exec sed '
         s/\\046/\&/g;
         s/\\46/\&/g;
         s/\\075/=/g;
         s/\\75/=/g;
         /targetBlogID/!d;
         s/.*targetBlogID=//;
         s/&.*//;
         '); 
    
         2.
         curl -o y.htm https://www.blogger.com/feeds/$x/posts/default 
    
         3. 
         exec sed '
         # ^M is "\r"
         s/^[0-9a-f]*^M//; 
         s/&lt;/</g;
         s/&gt;/>/g;
         s/&amp;/\&/g;
         s/&quot;/\"/g;
         1i\
         <br><br>
         
         s/<name>/<br><br>name &/g;
         s/<uri>/<br>uri &/g;
         s/<generator>/<br>generator &/g;
         s/Blogger//;
         s/<id>/<br>id &/g;
         s/<published>/<br>published &/g;
         s/<email>/<br>email &/g;
         s/<title type=.text.>/<br><br>&/g;
         s/<openSearch:totalResults>/<br>total results &/g;
         s/<openSearch:startIndex>/<br>start index &/g;
         s/<openSearch:itemsPerPage>/<br>items per page &/g;
         s/<updated>/<br>updated &/g;
         s/<thr:total>/<br>thr:total &/g;
         s/<\/feed>/&<br><br><br>/;
         s/^M*/<br>/;
         ' y.htm \
         |exec tr -cd '\12\40-\176'

~~~
always_good
It's almost like you asked that question to bait someone into giving you an
excuse to show "just how easy it is" with some cli hacking.

------
drej
Borrowing an argument from an article talking the speed of Python: focus on
what's your bottleneck. If you're worried about the performance of your tool,
make sure it's not actually waiting for something else (IO, network,
scheduling, ...).

Here, unless you're parsing a large amount of already downloaded files (a
website dump, [re]parsing of a long-standing archive etc.), you're not going
to get a huge benefit from using a fast parser, because network is going to be
the challenging factor.

Keep that in mind.

~~~
Walkman
I totally agree! I like Go but this is not a field I would ever use it.
Parsing the site-s will never will be the bottleneck, but following the HTML
changes or making it work for multiple sites is... When it takes maybe even
seconds to download a page, a couple of millisecond performance advantage of
Go doesn't matter at all.

------
Xeoncross
Remember to setup DNS caching on the box or use something like
[https://github.com/viki-org/dnscache](https://github.com/viki-org/dnscache).

Also, there doesn't seem to be any checking reading the response body. You
want to limit the read length.

------
gjem97
The HTML parsing part appears to be "golang.org/x/net/html". Does anyone have
experience parsing "real world" html with this? How does it do?

~~~
jerf
It's an HTML5-compliant parser. That means that, modulo bugs, it should
produce the same results as any other modern HTML parser, which should also
all be based on HTML5.

For context, since I think a lot of people are still unaware of this, the
HTML5 standard precisely specifies how HTML should be parsed:
[https://www.w3.org/TR/html5/syntax.html#parsing](https://www.w3.org/TR/html5/syntax.html#parsing)
This is based on a survey of how the various browsers were handling it in
reality, so it's not just one of those theoretical things that everybody
ignores, it's an algorithm extracted from the brutal pragmatism of many
separate code bases over many years. In theory now, all HTML parsing libraries
should now be able to take the same input and produce the same DOM nodes. In
practice I've not used a variety of such libraries, nor have I fed them very
much pathological input, so I can't vouch for if this 100% true in practice,
but in theory, there should no longer be any significant differences between
HTML parsers in various languages, as they come on board with HTML5
compliance.

------
stesch
Why are scrapers and scraping so popular? What is a real use case for it?

~~~
stephengillie
Not seeing ads.

Secondarily, there is a lot of data on the internet stored only in HTML pages.
For data with multiple sources, HTML is still usually a common format. HTML
just has more punctuation and errata to filter out than JSON, XML, or CSV.

~~~
stesch
So you want the data but you don't want to ask for it?

~~~
stephengillie
If I come across data I want/need, why should I have to highlight, copy, and
read from my clipboard? It's much more elegant to have automation do this.

------
maxpert
OnHtml callbacks? Not a big fan of callbacks when you have channels.

~~~
weberc2
Channels require concurrency; you have to spin up another goroutine and take
care not to let or deadlock. Channels are for communication between
goroutines, not for general abstraction.

~~~
hnlmorg
I agree. Callbacks makes sense in this context and is the idiomatic way to
write the code given that's the approach used in the standard library. eg:

[https://golang.org/pkg/path/filepath/#Walk](https://golang.org/pkg/path/filepath/#Walk)

[https://golang.org/pkg/net/http/#HandleFunc](https://golang.org/pkg/net/http/#HandleFunc)

------
tree_of_item
Is there any reason to use this instead of Puppeteer? It feels like Puppeteer
is going to dominate this space unless another browser vendor makes their own
framework.

