
The Wrong Way to Get Noticed by YC - matt1
http://www.mattmazur.com/2008/08/the-wrong-way-to-get-noticed-by-yc/
======
pg
Don't worry, you didn't slow down the server. I throttled your IP address. The
server can withstand a crawler or two. The reason I ask people not to crawl
the site is simply that if I let one person do it, I have to let everyone, and
considering the audience here, that might mean we'd have 100 of them.

~~~
kylec
Why not create periodic dumps of the database and allow people to glean
interesting statistical data from it? You could remove any non-public
information (passwords, preferences, individual up/down voting, last 2 octets
of the IP, etc). It would be interesting to see what hours and what days
people are active, where the visitors are coming from, what the most common
words used are, etc.

~~~
boredguy8
Like

<http://news.ycombinator.com/item?id=172701>
<http://news.ycombinator.com/item?id=182374>
<http://news.ycombinator.com/item?id=197644>
<http://news.ycombinator.com/item?id=212491>
<http://news.ycombinator.com/item?id=213891>

? <http://news.ycombinator.com/item?id=218782>

~~~
andreyf
The data you can gather from crawling the site aren't as specific as the data
in the database. For example - you don't know the exact submission times of
older pieces, only the day.

~~~
chengmi
That information can be derived from multiple observations.

------
geuis
@pg I've started checking out Gnip, <http://www.gnipcentral.com>. Their API
makes it very easy for sites to publish their data to the gnip service. They
are pretty hardcore about "hit us as much as you want" to get data. Twitter,
Digg, Flickr, and a number of other services are publishing to them.

Might be a good way for HN to avoid getting hammered by crawlers and still let
the hacker-types slurp the data.

~~~
mleonhard
The title of the gnip homepage is not professional: 'Gnip: We got $h*t to pop'
:(

------
tel
Of course, the real question is how you managed to not check HN for almost all
of Wednesday!

------
hooande
Paul is really cool man, don't worry about it. He'll probably respect the fact
that you tried to make something more than anything.

It's actually kind of hard to upset him. Frankly, he has a lot of money. It's
going to take more than one errant script to ruin his day.

~~~
babul
Does having a lot of money make people harder to upset?

~~~
davi
I'm starting to envy people with even-keeled dispositions the way other people
envy the rich.

------
staticshock
If you must, why not just scrape the google cache instead of the live site?

[http://www.google.com/search?q=cache:news.ycombinator.com/it...](http://www.google.com/search?q=cache:news.ycombinator.com/item%3Fid%3D261985)

At this moment, that's the latest cached item they have. So, it'll lag by a
few days, but that wouldn't matter much here.

~~~
matt1
On second thought Google cache might not be the best idea:

"We're sorry...

... but your query looks similar to automated requests from a computer virus
or spyware application. To protect our users, we can't process your request
right now.

We'll restore your access as quickly as possible, so try again soon. In the
meantime, if you suspect that your computer or network has been infected, you
might want to run a virus checker or spyware remover to make sure that your
systems are free of viruses and other spurious software.

If you're continually receiving this error, you may be able to resolve the
problem by deleting your Google cookie and revisiting Google. For browser-
specific instructions, please consult your browser's online support center.

If your entire network is affected, more information is available in the
Google Web Search Help Center.

We apologize for the inconvenience, and hope we'll see you again on Google. To
continue searching, please type the characters you see below:"

~~~
bullseye
Include a referrer HTTP header, set the user agent, and set a random wait time
between 500ms and 5s.

I crawl Google frequently and that format has always worked for me... At least
it did until I made this post. :)

~~~
Hexstream
If you can bypass the rules they don't count, eh?

~~~
fauigerzigerk
To be honest I can't take it seriously when a company that makes billions from
crawling other people's sites makes a rule that you're not allowed to crawl
theirs.

Google has a lot of conflicts with media companies and it always starts out
with google ignoring some "rules" expecting to either win in court or settle
at some point.

So I think breaking these rules is part of the process of how sensible rules
are established in the first place. Yes it's recursive, but HN readers should
be smart enough to understand that ;-)

~~~
neilc
_To be honest I can't take it seriously when a company that makes billions
from crawling other people's sites makes a rule that you're not allowed to
crawl theirs._

Why not? Google obeys robots.txt -- if you don't want them to crawl your site,
it is trivial to arrange. I think violating Google's terms of use is pretty
hard to justify as ethical.

~~~
fauigerzigerk
Google obeys robots.txt but not much else. Just ask publishers and news
outlets. And as I said, I don't see rules as a black and white thing. If they
were, there would never be any progress. No iTunes without napster. No
Microsoft being the biggest software company in the world without piracy. That
doesn't mean it's ethically justified to break any rule. I just think it's not
as simple as someone stating their own rules and everyone automatically
obeying it.

------
markbao
I started to submit a lot (a TON) when I was really addicted to HN a few
months ago. Received a similar message but on the topic of submitting a lot of
articles from PG. Reaction was also similar, something along the lines of
shitshitshitshit there goes my chance.

------
pedalpete
if you want to do this kind of stuff, why not go with Yahoo BOSS or something
similar. You could theoretically do some of the sorting you were hoping to do.

------
Readmore
Haha I had the same problem the first time I set my Ruby crawler loose. Within
5 minutes I had crashed a "large electronics retailer's site". I'm omitting
the name to protect the guilty.

~~~
tstegart
Wouldn't the guilty be you? :)

~~~
tlrobinson
Presumably a site that can be taken down by a single simple crawler is guilty
of crappy scaling.

------
maxklein
The answer is called Sleep(3000);

~~~
d0mine
Wait 9 days and the index is ready
<http://www.google.com/search?q=270%2C000*3+seconds+in+days>

~~~
qw
That's not bad. I'd wait for that.

And it will only take a few hours to update later.

------
gaika
Have to confess, I've got "please stop" email too (for my comments). First
thing that came to mind was "there goes my chance to get funded by YC"

~~~
mattmaroon
Couldn't you just create a new HN account?

~~~
allenbrunson
that sounds ill-advised, at best. i'd lay odds that pg pays enough attention
to be able to connect the new username with the old one, if he put any effort
into it.

------
lyime
After reading the post and thread. I have a sudden urge to write a crawler.
Obviously not on YC ;)

------
jrockway
How did pg find your e-mail? IP address from web log -> user account ->
profile?

~~~
brianlash
"A few weeks ago I emailed Paul Graham asking whether I could create a
searchable database of Hacker News."

PG had the email address from that first message. How he made the link between
the server load and Matt's index is the real question.

------
mrtron
I feel left out that I haven't felt the need to crawl YC...I am merely hiding
in anonymity!

------
sh1mmer
I think it's pretty big of you to fess up and apologise. Honesty and sincerity
count.

Well done. :)

------
henning
"You could do all sorts of interesting analysis on it… top posts, top
contributors, posting frequency, etc etc."

How is that interesting?

~~~
alaskamiller
You would be surprised: <http://top.searchyc.com/>

