The Wrong Way to Get Noticed by YC

pg · on Aug 8, 2008

Don't worry, you didn't slow down the server. I throttled your IP address. The server can withstand a crawler or two. The reason I ask people not to crawl the site is simply that if I let one person do it, I have to let everyone, and considering the audience here, that might mean we'd have 100 of them.

kylec · on Aug 8, 2008

Why not create periodic dumps of the database and allow people to glean interesting statistical data from it? You could remove any non-public information (passwords, preferences, individual up/down voting, last 2 octets of the IP, etc). It would be interesting to see what hours and what days people are active, where the visitors are coming from, what the most common words used are, etc.

boredguy8 · on Aug 9, 2008

Like

http://news.ycombinator.com/item?id=172701 http://news.ycombinator.com/item?id=182374 http://news.ycombinator.com/item?id=197644 http://news.ycombinator.com/item?id=212491 http://news.ycombinator.com/item?id=213891

? http://news.ycombinator.com/item?id=218782

andreyf · on Aug 9, 2008

The data you can gather from crawling the site aren't as specific as the data in the database. For example - you don't know the exact submission times of older pieces, only the day.

chengmi · on Aug 9, 2008

That information can be derived from multiple observations.

trevelyan · on Aug 9, 2008

Maybe because it's too much work.

maxklein · on Aug 9, 2008

How did you discover it that quickly then? Do you have some type of server monitoring tool?

pg · on Aug 9, 2008

There are a bunch of (mostly primitive) ways to watch what's going on. I didn't notice it all that quickly, though.

spiralhead · on Aug 8, 2008

the real question here, I think, is do you really have any legal recourse to stop us (not just being provocative, I actually wonder)?

wmf · on Aug 8, 2008

There have been several lawsuits about Web crawlers: http://ilt.eff.org/index.php/Trespass_to_Chattels

dominik · on Aug 8, 2008

Yes, but it depends.

geuis · on Aug 9, 2008

@pg I've started checking out Gnip, http://www.gnipcentral.com. Their API makes it very easy for sites to publish their data to the gnip service. They are pretty hardcore about "hit us as much as you want" to get data. Twitter, Digg, Flickr, and a number of other services are publishing to them.

Might be a good way for HN to avoid getting hammered by crawlers and still let the hacker-types slurp the data.

mleonhard · on Aug 10, 2008

The title of the gnip homepage is not professional: 'Gnip: We got $h*t to pop' :(

rapind · on Aug 9, 2008

Sounds like a great idea.

tel · on Aug 9, 2008

Of course, the real question is how you managed to not check HN for almost all of Wednesday!

hooande · on Aug 9, 2008

Paul is really cool man, don't worry about it. He'll probably respect the fact that you tried to make something more than anything.

It's actually kind of hard to upset him. Frankly, he has a lot of money. It's going to take more than one errant script to ruin his day.

babul · on Aug 9, 2008

Does having a lot of money make people harder to upset?

davi · on Aug 9, 2008

I'm starting to envy people with even-keeled dispositions the way other people envy the rich.

khafra · on Aug 11, 2008

All other variables held equal, the more my money reserves and influx grow in proportion to my obligations and outflow, the more comfortable I am--with diminishing returns, of course. The part that gives money the "doesn't make you happy" reputation is that obligations usually grow as income does, and often other variables vary.

staticshock · on Aug 8, 2008

If you must, why not just scrape the google cache instead of the live site?

http://www.google.com/search?q=cache:news.ycombinator.com/it...

At this moment, that's the latest cached item they have. So, it'll lag by a few days, but that wouldn't matter much here.

matt1 · on Aug 9, 2008

On second thought Google cache might not be the best idea:

"We're sorry...

... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now.

We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software.

If you're continually receiving this error, you may be able to resolve the problem by deleting your Google cookie and revisiting Google. For browser-specific instructions, please consult your browser's online support center.

If your entire network is affected, more information is available in the Google Web Search Help Center.

We apologize for the inconvenience, and hope we'll see you again on Google. To continue searching, please type the characters you see below:"

bullseye · on Aug 9, 2008

Include a referrer HTTP header, set the user agent, and set a random wait time between 500ms and 5s.

I crawl Google frequently and that format has always worked for me... At least it did until I made this post. :)

Hexstream · on Aug 9, 2008

If you can bypass the rules they don't count, eh?

fauigerzigerk · on Aug 9, 2008

To be honest I can't take it seriously when a company that makes billions from crawling other people's sites makes a rule that you're not allowed to crawl theirs.

Google has a lot of conflicts with media companies and it always starts out with google ignoring some "rules" expecting to either win in court or settle at some point.

So I think breaking these rules is part of the process of how sensible rules are established in the first place. Yes it's recursive, but HN readers should be smart enough to understand that ;-)

neilc · on Aug 10, 2008

To be honest I can't take it seriously when a company that makes billions from crawling other people's sites makes a rule that you're not allowed to crawl theirs.

Why not? Google obeys robots.txt -- if you don't want them to crawl your site, it is trivial to arrange. I think violating Google's terms of use is pretty hard to justify as ethical.

fauigerzigerk · on Aug 10, 2008

Google obeys robots.txt but not much else. Just ask publishers and news outlets. And as I said, I don't see rules as a black and white thing. If they were, there would never be any progress. No iTunes without napster. No Microsoft being the biggest software company in the world without piracy. That doesn't mean it's ethically justified to break any rule. I just think it's not as simple as someone stating their own rules and everyone automatically obeying it.

staticshock · on Aug 9, 2008

interesting. at what point do you get that message?

matt1 · on Aug 9, 2008

very quickly.

when i check it in my browser it was fine, but as soon as i started scanning with the software google blocked it. they may have recognized that the requests werent coming from a standard browser, so they flagged my ip.

markbao · on Aug 8, 2008

I started to submit a lot (a TON) when I was really addicted to HN a few months ago. Received a similar message but on the topic of submitting a lot of articles from PG. Reaction was also similar, something along the lines of shitshitshitshit there goes my chance.

pedalpete · on Aug 8, 2008

if you want to do this kind of stuff, why not go with Yahoo BOSS or something similar. You could theoretically do some of the sorting you were hoping to do.

Readmore · on Aug 8, 2008

Haha I had the same problem the first time I set my Ruby crawler loose. Within 5 minutes I had crashed a "large electronics retailer's site". I'm omitting the name to protect the guilty.

tstegart · on Aug 9, 2008

Wouldn't the guilty be you? :)

tlrobinson · on Aug 10, 2008

Presumably a site that can be taken down by a single simple crawler is guilty of crappy scaling.

maxklein · on Aug 8, 2008

The answer is called Sleep(3000);

d0mine · on Aug 8, 2008

Wait 9 days and the index is ready http://www.google.com/search?q=270%2C000*3+seconds+in+days

qw · on Aug 9, 2008

That's not bad. I'd wait for that.

And it will only take a few hours to update later.

gaika · on Aug 8, 2008

Have to confess, I've got "please stop" email too (for my comments). First thing that came to mind was "there goes my chance to get funded by YC"

icey · on Aug 9, 2008

Well, Robert Morris is part of YC. I'm sure if anyone understands accidentally overloading a system because of programming curiosity, it's RTM.

mattmaroon · on Aug 9, 2008

Couldn't you just create a new HN account?

allenbrunson · on Aug 9, 2008

that sounds ill-advised, at best. i'd lay odds that pg pays enough attention to be able to connect the new username with the old one, if he put any effort into it.

Tichy · on Aug 8, 2008

I think YC have better humor than that.

lyime · on Aug 9, 2008

After reading the post and thread. I have a sudden urge to write a crawler. Obviously not on YC ;)

jrockway · on Aug 8, 2008

How did pg find your e-mail? IP address from web log -> user account -> profile?

brianlash · on Aug 8, 2008

"A few weeks ago I emailed Paul Graham asking whether I could create a searchable database of Hacker News."

PG had the email address from that first message. How he made the link between the server load and Matt's index is the real question.

matt1 · on Aug 8, 2008

I post under the same IP that I crawled on. Wouldn't have been too hard to connect the two.

a-priori · on Aug 8, 2008

Maybe he had his indexer log into the site?

mrtron · on Aug 9, 2008

I feel left out that I haven't felt the need to crawl YC...I am merely hiding in anonymity!

sh1mmer · on Aug 9, 2008

I think it's pretty big of you to fess up and apologise. Honesty and sincerity count.

Well done. :)

henning · on Aug 8, 2008

"You could do all sorts of interesting analysis on it… top posts, top contributors, posting frequency, etc etc."

How is that interesting?

alaskamiller · on Aug 8, 2008

You would be surprised: http://top.searchyc.com/