Hacker News new | past | comments | ask | show | jobs | submit login
The Wrong Way to Get Noticed by YC (mattmazur.com)
90 points by matt1 on Aug 8, 2008 | hide | past | favorite | 50 comments

Don't worry, you didn't slow down the server. I throttled your IP address. The server can withstand a crawler or two. The reason I ask people not to crawl the site is simply that if I let one person do it, I have to let everyone, and considering the audience here, that might mean we'd have 100 of them.

Why not create periodic dumps of the database and allow people to glean interesting statistical data from it? You could remove any non-public information (passwords, preferences, individual up/down voting, last 2 octets of the IP, etc). It would be interesting to see what hours and what days people are active, where the visitors are coming from, what the most common words used are, etc.

The data you can gather from crawling the site aren't as specific as the data in the database. For example - you don't know the exact submission times of older pieces, only the day.

That information can be derived from multiple observations.

Maybe because it's too much work.

How did you discover it that quickly then? Do you have some type of server monitoring tool?

There are a bunch of (mostly primitive) ways to watch what's going on. I didn't notice it all that quickly, though.

the real question here, I think, is do you really have any legal recourse to stop us (not just being provocative, I actually wonder)?

There have been several lawsuits about Web crawlers: http://ilt.eff.org/index.php/Trespass_to_Chattels

Yes, but it depends.

@pg I've started checking out Gnip, http://www.gnipcentral.com. Their API makes it very easy for sites to publish their data to the gnip service. They are pretty hardcore about "hit us as much as you want" to get data. Twitter, Digg, Flickr, and a number of other services are publishing to them.

Might be a good way for HN to avoid getting hammered by crawlers and still let the hacker-types slurp the data.

The title of the gnip homepage is not professional: 'Gnip: We got $h*t to pop' :(

Sounds like a great idea.

Of course, the real question is how you managed to not check HN for almost all of Wednesday!

Paul is really cool man, don't worry about it. He'll probably respect the fact that you tried to make something more than anything.

It's actually kind of hard to upset him. Frankly, he has a lot of money. It's going to take more than one errant script to ruin his day.

Does having a lot of money make people harder to upset?

I'm starting to envy people with even-keeled dispositions the way other people envy the rich.

All other variables held equal, the more my money reserves and influx grow in proportion to my obligations and outflow, the more comfortable I am--with diminishing returns, of course. The part that gives money the "doesn't make you happy" reputation is that obligations usually grow as income does, and often other variables vary.

If you must, why not just scrape the google cache instead of the live site?


At this moment, that's the latest cached item they have. So, it'll lag by a few days, but that wouldn't matter much here.

On second thought Google cache might not be the best idea:

"We're sorry...

... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now.

We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software.

If you're continually receiving this error, you may be able to resolve the problem by deleting your Google cookie and revisiting Google. For browser-specific instructions, please consult your browser's online support center.

If your entire network is affected, more information is available in the Google Web Search Help Center.

We apologize for the inconvenience, and hope we'll see you again on Google. To continue searching, please type the characters you see below:"

Include a referrer HTTP header, set the user agent, and set a random wait time between 500ms and 5s.

I crawl Google frequently and that format has always worked for me... At least it did until I made this post. :)

If you can bypass the rules they don't count, eh?

To be honest I can't take it seriously when a company that makes billions from crawling other people's sites makes a rule that you're not allowed to crawl theirs.

Google has a lot of conflicts with media companies and it always starts out with google ignoring some "rules" expecting to either win in court or settle at some point.

So I think breaking these rules is part of the process of how sensible rules are established in the first place. Yes it's recursive, but HN readers should be smart enough to understand that ;-)

To be honest I can't take it seriously when a company that makes billions from crawling other people's sites makes a rule that you're not allowed to crawl theirs.

Why not? Google obeys robots.txt -- if you don't want them to crawl your site, it is trivial to arrange. I think violating Google's terms of use is pretty hard to justify as ethical.

Google obeys robots.txt but not much else. Just ask publishers and news outlets. And as I said, I don't see rules as a black and white thing. If they were, there would never be any progress. No iTunes without napster. No Microsoft being the biggest software company in the world without piracy. That doesn't mean it's ethically justified to break any rule. I just think it's not as simple as someone stating their own rules and everyone automatically obeying it.

interesting. at what point do you get that message?

very quickly.

when i check it in my browser it was fine, but as soon as i started scanning with the software google blocked it. they may have recognized that the requests werent coming from a standard browser, so they flagged my ip.

I started to submit a lot (a TON) when I was really addicted to HN a few months ago. Received a similar message but on the topic of submitting a lot of articles from PG. Reaction was also similar, something along the lines of shitshitshitshit there goes my chance.

if you want to do this kind of stuff, why not go with Yahoo BOSS or something similar. You could theoretically do some of the sorting you were hoping to do.

Haha I had the same problem the first time I set my Ruby crawler loose. Within 5 minutes I had crashed a "large electronics retailer's site". I'm omitting the name to protect the guilty.

Wouldn't the guilty be you? :)

Presumably a site that can be taken down by a single simple crawler is guilty of crappy scaling.

The answer is called Sleep(3000);

That's not bad. I'd wait for that.

And it will only take a few hours to update later.

Have to confess, I've got "please stop" email too (for my comments). First thing that came to mind was "there goes my chance to get funded by YC"

Well, Robert Morris is part of YC. I'm sure if anyone understands accidentally overloading a system because of programming curiosity, it's RTM.

Couldn't you just create a new HN account?

that sounds ill-advised, at best. i'd lay odds that pg pays enough attention to be able to connect the new username with the old one, if he put any effort into it.

I think YC have better humor than that.

After reading the post and thread. I have a sudden urge to write a crawler. Obviously not on YC ;)

How did pg find your e-mail? IP address from web log -> user account -> profile?

"A few weeks ago I emailed Paul Graham asking whether I could create a searchable database of Hacker News."

PG had the email address from that first message. How he made the link between the server load and Matt's index is the real question.

I post under the same IP that I crawled on. Wouldn't have been too hard to connect the two.

Maybe he had his indexer log into the site?

I feel left out that I haven't felt the need to crawl YC...I am merely hiding in anonymity!

I think it's pretty big of you to fess up and apologise. Honesty and sincerity count.

Well done. :)

"You could do all sorts of interesting analysis on it… top posts, top contributors, posting frequency, etc etc."

How is that interesting?

You would be surprised: http://top.searchyc.com/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact