

Hacker News Robots.txt Failure :( - gmaster1440

So here I am, sitting in front of my mac book pro wondering why my Hacker News app that I am developing began to crash. Turns out that Yahoo's YQL service began respecting robots.txt, and as a result is obeying HN's request to not allow scraping. With this, I beg you HN to allow at least minimal front page scraping for services like YQL. My app, that I hope you guys will be able to use in the near future, will be able to exist.<p>Thank you.<p>EXAMPLE: http://developer.yahoo.com/yql/console/?q=select%20<i>%20from%20html%20where%20url%3D%22http://news.ycombinator.com%22%20and%20xpath%3D//table//table//td[%40class%3D%22title%22]/a#h=select%20</i>%20from%20html%20where%20url%3D%22http%3A//news.ycombinator.com%22%20and%20xpath%3D%27//table//table//td%5B@class%3D%22title%22%5D/a%27
======
_delirium
What are you trying to scrape? This is HN's robots.txt:

    
    
      User-Agent: * 
      Disallow: /x?
      Disallow: /vote?
      Disallow: /reply?
      Disallow: /submitted?
      Disallow: /threads?
    

That doesn't look like it blocks either the front page, the /newest page, or
article pages (which have the form /item?id=N).

~~~
gmaster1440
try:
[http://developer.yahoo.com/yql/console/?q=select%20*%20from%...](http://developer.yahoo.com/yql/console/?q=select%20*%20from%20html%20where%20url%3D%22http://news.ycombinator.com%22%20and%20xpath%3D//table//table//td\[%40class%3D%22title%22\]/a#h=select%20*%20from%20html%20where%20url%3D%22http%3A//news.ycombinator.com%22%20and%20xpath%3D%27//table//table//td%5B@class%3D%22title%22%5D/a%27)

~~~
_delirium
Works for me--- I clicked "test" and got what looks like a list of HN's front-
page links in the results.

~~~
gmaster1440
very weird, this is what i get:
[http://img444.imageshack.us/img444/290/screenshot20100413at8...](http://img444.imageshack.us/img444/290/screenshot20100413at813.png)

------
ulvund
I think you should try sending pg an email

~~~
gmaster1440
what's pg's email?

