Don't worry, you didn't slow down the server. I throttled your IP address. The server can withstand a crawler or two. The reason I ask people not to crawl the site is simply that if I let one person do it, I have to let everyone, and considering the audience here, that might mean we'd have 100 of them.
Why not create periodic dumps of the database and allow people to glean interesting statistical data from it? You could remove any non-public information (passwords, preferences, individual up/down voting, last 2 octets of the IP, etc). It would be interesting to see what hours and what days people are active, where the visitors are coming from, what the most common words used are, etc.
The data you can gather from crawling the site aren't as specific as the data in the database. For example - you don't know the exact submission times of older pieces, only the day.
@pg I've started checking out Gnip, http://www.gnipcentral.com. Their API makes it very easy for sites to publish their data to the gnip service. They are pretty hardcore about "hit us as much as you want" to get data. Twitter, Digg, Flickr, and a number of other services are publishing to them.
Might be a good way for HN to avoid getting hammered by crawlers and still let the hacker-types slurp the data.
All other variables held equal, the more my money reserves and influx grow in proportion to my obligations and outflow, the more comfortable I am--with diminishing returns, of course. The part that gives money the "doesn't make you happy" reputation is that obligations usually grow as income does, and often other variables vary.
On second thought Google cache might not be the best idea:
"We're sorry...
... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now.
We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software.
If you're continually receiving this error, you may be able to resolve the problem by deleting your Google cookie and revisiting Google. For browser-specific instructions, please consult your browser's online support center.
If your entire network is affected, more information is available in the Google Web Search Help Center.
We apologize for the inconvenience, and hope we'll see you again on Google. To continue searching, please type the characters you see below:"
To be honest I can't take it seriously when a company that makes billions from crawling other people's sites makes a rule that you're not allowed to crawl theirs.
Google has a lot of conflicts with media companies and it always starts out with google ignoring some "rules" expecting to either win in court or settle at some point.
So I think breaking these rules is part of the process of how sensible rules are established in the first place. Yes it's recursive, but HN readers should be smart enough to understand that ;-)
To be honest I can't take it seriously when a company that makes billions from crawling other people's sites makes a rule that you're not allowed to crawl theirs.
Why not? Google obeys robots.txt -- if you don't want them to crawl your site, it is trivial to arrange. I think violating Google's terms of use is pretty hard to justify as ethical.
Google obeys robots.txt but not much else. Just ask publishers and news outlets. And as I said, I don't see rules as a black and white thing. If they were, there would never be any progress. No iTunes without napster. No Microsoft being the biggest software company in the world without piracy. That doesn't mean it's ethically justified to break any rule. I just think it's not as simple as someone stating their own rules and everyone automatically obeying it.
when i check it in my browser it was fine, but as soon as i started scanning with the software google blocked it. they may have recognized that the requests werent coming from a standard browser, so they flagged my ip.
I started to submit a lot (a TON) when I was really addicted to HN a few months ago. Received a similar message but on the topic of submitting a lot of articles from PG. Reaction was also similar, something along the lines of shitshitshitshit there goes my chance.
if you want to do this kind of stuff, why not go with Yahoo BOSS or something similar. You could theoretically do some of the sorting you were hoping to do.
Haha I had the same problem the first time I set my Ruby crawler loose. Within 5 minutes I had crashed a "large electronics retailer's site". I'm omitting the name to protect the guilty.
that sounds ill-advised, at best. i'd lay odds that pg pays enough attention to be able to connect the new username with the old one, if he put any effort into it.