Hacker News new | comments | show | ask | jobs | submit login

That's pretty disappointing. I'd love to hear an official explanation of why this choice was made.



I quite agree. I use Google to search HN. I just used it the other day, because I didn't remember this one guy's startup or username, but I knew it involved family websites. I was able to find it after a bit of search-fu. I am sad to see that HN won't be indexed by Google anymore. :(


Yes, the way that references vanish from hn, I really need Google to find things a couple of days old. I have tried and found other search methods not worthwhile.

Grumble...


One way to search news.ycombinator is to visit searchyc.com


Unless they're obeying robots.txt, which they probably should be


I tried search google go with and without quotes and the results were the same. Also, very few of the results actually pretained to google go. I would still rather have google...


Hmm, does search-fu qualify as something you can put on a resume?


I wonder if part of the reason is that the Arc web server doesn't send HTTP status codes properly. Every URL path, valid or not, returns a 200, even if the content is only "Unknown."

For example, check the headers sent back for:

http://news.ycombinator.com/no/way/this/is/possibly/valid

    HTTP/1.1 200 OK
    Content-Type: text/html; charset=utf-8
    Connection: close
If a search engine sees that, it will periodically revisit it forever.


The correct solution is to fix the Arc server to return correct HTTP headers, not block everything using robots.txt.


That may be the correct solution, but I'm sure the current option was chosen more because it was the quick solution.


Maybe they're blocking everything while they're in the process of fixing it.


Until someone creates an HTML link to such a document, how does this problem manifest itself?


The site itself created zillions of them itself with its random, expiring fnids. Although the scheme has changed now, cralwers would still hit those obsolete URLs.


typos?


This sucks for one major reason. Sometimes I want to dig up a link that was on HN, and I don't know where else to find it. So I search google with site:ycombinator.com. If they block Google, then that won't work.

At the very least, add a search engine to HN itself so I can search for old links.


http://searchyc.com/ (assuming it's not blocked also)


Agreed 100%

Why produce a great site and then not let your users access its valuable content?

Things scroll by the front page so fast it's not really fair not to provide a search feature.

I don't care if they let external search engines (Google et al) index it, but they should at least provide /some/ kind of internal search feature themselves.

IMO providing a search feature is probably at least as effective, if not more so, than the "noprocrast" flag.

After all, if I know I can search past items at any point in the future, I don't feel a pressure to load up the front page several times a day.

Whereas if the fast-scrolling front page is the only way to access items, it encourages that sort of rat-hitting-the-bar behavior I learned about in college psych classes ;)


HN is very, very slow. Removing crawlers likely helps with that.


Removing a leg or two could also help one towards a weight loss goal.


But then you lose a lot of options for exercise. Walking and running are definitely out.


I would bet good money that it's nothing that a good dedicated server with, say, 32GiB ram and 2 or 4 sas spindles couldn't fix.


You can set gbot's crawler rate with google webmaster tools if that's the issue.


The robots.txt file's content-type is text/html, not text/plain. It won't work. I'm still able to access the contents of this site from Yahoo Pipes. Don't worry about it.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: