Don't worry, it doesn't mean anything. The software for ranking applications runs on the same server, and it is horribly inefficient (something 4 people use every 6 months doesn't tend to get optimized much). This weekend all of us were reading applications at the same time, and the system was getting so slow that I banned crawlers for a bit to buy us some margin. (Traffic from crawlers is much more expensive for us than traffic from human users, because it interacts badly with lazy item loading.) We only finished reading applications an hour before I had to leave for SXSW, so I forgot to set robots.txt back to the normal one, but I just did now.
Also, they use hacker news to communicate to applicants. I applied a while back, and one day while browsing I got a message through HN, something like "pg has a question for you" in the header bar. Clicking through is the question, and place to put the answer.
Basically, there's a whole messaging system in hacker news that most people don't see.
Have you considered using last-modified/etag headers to control the crawlers? If you detect a crawler, and it requests a page that it has visited before, you could just return a 304 without doing anything else.
robots.txt is really only like a note on the front door of your unlocked house that says "please only look at these items in these rooms." or "please do not look in my house" If someone wants to write a program to index the site they can still do so. Google, MSN, Bing, yahoo, et. al. will respect the robots.txt file though.
This sucks for one major reason. Sometimes I want to dig up a link that was on HN, and I don't know where else to find it. So I search google with site:ycombinator.com. If they block Google, then that won't work.
At the very least, add a search engine to HN itself so I can search for old links.
I quite agree. I use Google to search HN. I just used it the other day, because I didn't remember this one guy's startup or username, but I knew it involved family websites. I was able to find it after a bit of search-fu. I am sad to see that HN won't be indexed by Google anymore. :(
My vote is to constrain growth as much as possible, at least that which comes from stupid sources. Smart hackers will find this site just fine without Yahoo or MSN, probably even Google.
As "evil" as blocking sites and crawlers may sound, I think these types of measures will be necessary to preserve the quality of content here. Whatever actions further that objective have my vote.
Will this break Google Reader and other RSS readers from legitimately using the RSS feed here? After all, Google Reader uses a bot to read the feed and allows us to search it from within their app.. much like their search engine does with regular pages :-) (That is, "web robots" aren't just spiders.)
I do exactly the same thing at least once a day. I can always remember just a snippet of the post or subject I want to find and that type of query always gets me what I want. I just used it today to find the YC company who is doing the journalism stuff so I could send the link to my nephew.
The 'official' HNSearch is completely useless for anyone who would ever want to search HN. It's a Firefox Extension (!) that adds a 400px sidebar to your google search results pages showing site: searches along with an extra adsense ad. It's less useless than the 5 IE toolbars your annoying relatives live with, but just as obnoxious.
I just installed it again for a laugh, and it actually redirects to searchyc.com results when you click on its "Hacker News" header — even though pg staunchly refuses to ever link to searchyc. I made fun of him for it at his party before Startup School and he tried to laugh it off with "they're a YC company!" as if that weren't obvious on the face of it.
He invited alaskamiller to that private party too — I was kicking myself later for not putting a little more fuel on that fire by pulling him into that conversation (not that I didn't piss off pg enough).
There's a disenfranchisement heuristic that causes your votes to be recorded (can't vote again) but not counted. It usually persists for about a day. I think it's supposed to kill voting rings but its behavior is pretty odd.
The number in your browser gets incremented or decremented client-side independently of reality — it's doing a blind GET request by pre-fetching an 'image', not making an XHR POST and getting the new karma value in the response. Like a lot of things in Arc, it's just poor HTTP.
If this is an anti-spam measure, I expect it will be about as effective as "no follow" was. There's still plenty of good reasons for spammers to submit crap. RSS and Twitter syndication of links, for example.
On the other hand, if the goal is to push HN back into semi-obscurity by making it harder to find, it might work.
Does anyone know why HN decided to disallow search engines? It's certainly up to the website owner to decide, but there are good ways to (say) reduce the load on the web server without blocking search engines entirely.
The HN code keeps everything in memory (lazily loaded from disk). If every search engine and me-too crawler out there hits every page, you constantly have the entire HN corpus in memory. I assume the memory overload is what leads to the site being slow and unreliable. I don't know if the arc code has any LRU scheme or if everything sticks around until the server falls over.
HN just needs a super simple caching proxy in front of it to reduce load on the app server. Alternatively, just generate static pages for all topics older than 5 days.
Thank you and that makes total sense. Point and case?:
I googled an executive of a company that was looking to do business with me. I came across some google groups discussions that he was part of from a couple years ago... and let's just say the material would be a dream for someone with a vendetta against the person.
I guess that might have to do with load these search engines generate. Previosly in a thread we checked that comments posted in HN appear in google search after a minute of posting. That should create a good amount of load on HN. Multiply it for all the search engines in the wild, and PG probably have decided blocking those woun't do any harm. My guesss. Could be wrong, or it might be temporary or a mistake.
My bet is that it is for technical reasons. Whenever you see a reply link (among other things) the server has to persist a continuation for it. Getting all of those pages crawled must generate a massive number of continuations to persist.
I know that this is one of the reasons that vote-links look the way they do.
If that's not the reason, then my #2 guess is that getting the whole site crawled is moving too many comments into the cache. If I remember correctly, whenever an uncached comment gets accessed, it gets cached (basically adding it to a hash), and thus added to the in-memory ram. That gradually raises the memory requirements for running the app, until pg does a restart, which resets the cache as well.
It's unfortunate because I often use site: to find old threads on ycombinator. especially since news.ycombinator.com doesn't have its own search facility, external search engines are the old way to comb through old posts here.
Well, at least my profile on HN will no longer be the first result for a Google search for my username (it can now revert to some blog that's not mine). But I really wish HN were searchable; I try to find insight on the comment threads here all the time.
In my mind legitimate crawlers are large engineered programs that respect robots.txt, whereas a scraper might just be a Perl script some dude wrote to check a website for updates periodically. Technically I suppose they all should probably respect robots.txt, but in my experience something you think of as a scraper rarely does.
That's not a legitimate reason to ban google. You can always (using google's webmaster tools) throttle their indexing. Or you can use search-engine specific lines in robots.txt (Bing anyone?) to ban misbehaving bots.
The only reason to do this is if you don't want your site indexed by google. Which I really can't think of a legitimate reason to do so.