

Scrape: A simple, higher level interface for Go web scraping - ericchiang
https://github.com/yhat/scrape

======
peteretep
Go has some weird syntactic sugar including where a method invocation is
rewritten by the compiler to pass in a value or a pointer depending on what
the _callee_ wants(!?!). And yet Go code is still littered with:

    
    
        if err != nil {
    

... rather than some simple, compile-time validated sugar to pass the error
value up the call chain. Yes, I've read the justification documents. No, they
still don't make a very convincing argument.

~~~
fishnchips
> rewritten by the compiler to pass in a value or a pointer

This is only method receivers, not for arguments. Essentially the way of
invoking a method on a pointer receiver (like '->' in C/C++) isn't any
different than invoking it on a value receiver (like '.' in C/C++). But you
can't pass 'pointer to int' to a method where 'int' is expected.

> some simple, compile-time validated sugar to pass the error value up the
> call chain

This bothers me a bit, too. Sometimes I was able to work around it in a
creative way. Take http handlers as an example:

    
    
      func standardHandler(w http.ResponseWriter, r *http.Request) {
        // errors may occur here...
      }
    

You can wrap these in an error handling function like this:

    
    
      type naiveHandler(w http.ResponseWriter, r *http.Request) error
    
      func handleErrors(fn naiveHandler) {
        return func(w http.ResponseWriter, r *http.Request) {
          if fn(w, r) != nil {
            // handle errors
          }
        }
      }
    
      http.HandleFunc("/", handleErrors(handler1))
      http.HandleFunc("/", handleErrors(handler2))
      // ...
    

This is handy since you can handle different errors differently - eg. report,
log etc - but uniformly between all handlers. To be fair the only syntactic
sugar I'd need in most cases is an equivalent of a C/C++ macro:

    
    
      RETURN_ERR_UNLESS_NIL(/* Go expression returning only error */)

------
rdudekul
To me goquery seems more intuitive than scrape, may be because I am more
familiar with jquery selectors syntax.

Any reason why yhat guys (ericchiang) created Scrape (and not use say
goquery)?

Can you make the matcher function in main.go go away with a simpler (more
intuitive) interface/api/dsl?

~~~
ericchiang
This is a very small amount of boilerplate around the golang.com/x/net/html
package. If you need the huge feature set of goquery, use that. But I find
this pretty suitable for my day to day problems.

------
jwcrux
I like goquery[1] for doing this type of thing.

[1]
[https://github.com/PuerkitoBio/goquery](https://github.com/PuerkitoBio/goquery)

------
thinxer
I'd like to introduce htmlutil[1] and cascadia[2] for DOM processing in Go
which is useful in scraping articles.

[1]: [https://github.com/thinxer/go-htmlutil](https://github.com/thinxer/go-
htmlutil)

[2]:
[https://github.com/andybalholm/cascadia](https://github.com/andybalholm/cascadia)

------
headzoo
Selfless plug.. May also want to check out Surf for web scraping.

[https://github.com/headzoo/surf](https://github.com/headzoo/surf) Docs:
[http://www.gosurf.io/](http://www.gosurf.io/)

Among other things goquery is baked in to easily select page elements using
CSS selectors.

------
chrissnell
This is very cool. I'm not much of a front-end guy so I'm struggling with the
examples. Would you mind posting up a simple example that will scrape--say--
the first TD tag of every row of a table? Thanks.

~~~
ericchiang

      rows := scrape.FindAll(table, scrape.ByTag(atom.Tr))
      cols := []*html.Node{}
      for _, row := range rows {
          // Find returns the first result
          col, ok := scrape.Find(row, scrape.ByTag(atom.Td))
          if ok {
              cols = append(cols, col)
          }
      }

~~~
chrissnell
Thanks!

------
lunixbochs
Nice! See also
[https://github.com/andrew-d/goscrape](https://github.com/andrew-d/goscrape)

------
bjblazkowicz
supporting xpath?

~~~
peteretep
Those who don't understand xpath are cursed to reinvent it, poorly.

