

Using YQL to grab HN links - chaosmachine
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fnews.ycombinator.com%2F%22%20and%0Axpath%3D'%2F%2Ftr%2Ftd%2Fa%5Bsubstring(%40href%2C1%2C4)%3D%22http%22%5D%5B%40href!%3D%22http%3A%2F%2Fycombinator.com%22%5D'%0A&format=xml

======
chaosmachine
The query is:

    
    
      select * from html where url="http://news.ycombinator.com/" and
      xpath='//tr/td/a[substring(@href,1,4)="http"][@href!="http://ycombinator.com"]'
    

You can play with it yourself here (needs yahoo login):

<http://developer.yahoo.com/yql/console/>

~~~
henning
I don't get it. Don't queries like that suffer from the same problems as
normal screen scraping?

~~~
jfarmer
Yeah, they're obviously brittle, but it's baby steps, y'know?

I think it's pretty crazy that you can now scrape well-marked pages with a
SQL-like syntax.

------
rjurney
Just spent 20 minutes trying to grab all article links from a newspaper's
website based on a url pattern. Failed. The docs could use some work.

Anyone know how to do this?

~~~
tectonic
For another approach, with a structure editor, check out
<http://parselets.com>

------
3ds
in python it would be like this (get BeautifulSoup first):

\--

    
    
      import urllib2
    
      from BeautifulSoup import BeautifulSoup
    
      ychtml = urllib2.urlopen('http://news.ycombinator.com/).read()
    
      for tdtitle in BeautifulSoup(ychtml).findAll("td", "title"):
    
        print tdtitle.a

------
yeahit
If you use Linux or some other Unix, you can also do it with standard Unix
tools:

    
    
      wget -O- news.ycombinator.com | grep -o http[^\"]*
    

Personally, I prefer curl because it writes to stdout per default:

    
    
      curl news.ycombinator.com | grep -o http[^\"]*
    

(After posting this, i noticed that HN cuts * signs at the end of a message.
So I have to add this text here, or the last * would not be displayed.)

~~~
sarp
This shows all urls on HN including images etc. whereas the original post
demonstrates retrieving only linked articles

~~~
yeahit
You can filter that with another grep for example:

    
    
      wget -O- news.ycombinator.com | grep -o 'title"><a href="[^"]*' | grep -o http.*

------
ghempton
Does it first convert the page to valid xhtml and then perform the xpath or
does it rely on the website to be well-marked?

~~~
simonw
It works against not-well-formed markup.

------
found_dead
I actually attended the talk on YQL at Barcamp Portland and it seems really
powerful.

You are basically able to use Yahoo's cloud servers and huge internet pipe for
free with this service.

