Ask PG: Are there IP blocks?
47 points by halostatue on Nov 4, 2011 | hide | past | web | favorite | 27 comments
I feel a little silly submitting this via Ask, but I don't see any other way to ask this.

Yesterday, I was no longer able to reach HN via my work wifi. I can ping HN and get responses, but the web server explicitly sends nothing back:

    % curl http://news.ycombinator.com
    curl: (52) Empty reply from server
I can still see HN through other networks (I'm doing this currently through my wifi hotspot), and apparently am still active.

Has a block been placed on my work network? I have captured the work network's root IP address if it's needed.

Yes, we block IPs that seem to be crawlers ignoring robots.txt. We've always blocked abusive IPs, but I tightened up the blocking a few weeks ago. A lot of people were crawling HN, most of them unnecessarily because they were doing things they could have done more efficiently through HNSearch's API.

Some users may remember that the site had gotten really slow a few weeks ago. One of the reasons it's faster now is that we cracked down on crawlers.

That makes sense. I suspect that the "cause" in this case was the chrome extension that I had been using (I don't remember the name offhand, but it's the one that shows changes in HN between visits). I usually left the tab open but only visited a couple of times a day, so it may have been doing something like that.

I think it's time I shut down http://ihackernews.com. It's getting impossible to keep the IP address from being blocked. Really unfortunate, because a lot of people like to use the site.

If you tell me the ip addr I can whitelist it. pg@ycombinator.com

Wow, thanks! Email sent.

I had to stop using it because of that. And I started to think of re-writing it as a self-hosted solution to hopefully avoid IP bans.

I don't know if you'd consider releasing it for anyone to install on their own server, and if pg would be fine with that solution. (assuming that each page load would dynamically scrape a single page on ycombinator.com)

edit: of course, it might still have its issues if a lot of people were to host it with the same host…

Another solution would be for pg to add a mobile stylesheet. That's honestly the only reason why I would need it.

Anyone know if this what the WP7 HN App uses? I've noticed that the app no longer works which is a shame, I used it all the time.

Appears so: https://github.com/jpf/hacker-news-wp7/blob/master/README

If HN could give back json instead of html I'm guessing a lot of the crawling could be mitigated to client side. Kind of like appending .json to everything on reddit

Beware of a Chrome extension called "Hacker News Sidebar", it presumably got me IP banned this week. It cross checks every page visited with HN to see if it has a thread and if so, displays the thread.

Here is the extension: https://chrome.google.com/webstore/detail/hhedbplnihmkekhgma...

Hmmm. I think I have that extension installed, too. Out that one comes, too.

I got blocked two weeks ago when I was playing with creating a "realtime" view of comments so you didn't have to refresh the page. To test I had it polling one story every five seconds and I think I left it running overnight. (Sorry about that.)

Next day, no HN, so I spent the next week browsing HN on Firefox with a proxy setup through an EC2 instance. Thankfully, my IP changed or the ban is gone.

For what I was doing the HNSearch API wouldn't have helped, but if there was an API like the one at ihackernews.com that's running and live, that'd be great.

I was automatically banned awhile ago, after doing something silly (checking all my bookmarks for dead links). The explanation was that the server thought I was DDOSing. It was OK after a week or so. Maybe it'll work out for you too.

I was doing something even more silly and was banned for 48 hours or so.

Question: Is there are software which can be easily installed on apache or inside app to detect crawlers?

We use ipban but that is not what we want: we want a system which can easily detected "bad" crawler or "abusing" user and ban them for some time.

As of now, we have a simple script going thru apache logs and sending list us list of IP and their activity.

We use fail2ban (available as ubuntu package) very successfully for this. You can point it to apache log and finesse the rules by browser string, URL or whatever you want.

At Amazon, it's not uncommon for both ycombinator and twitter to be unreachable because of our IPs being blocked.

I also get empty responses when I try to browse HN over Tor. I assume this is because my IP address looks like a spammer's. I, too, would like to know if there are IP blocks, and what (if anything) legitimate users can do to get around them.

There were some people doing some bad stuff over Tor a while ago, so we banned all the Tor exit nodes they were using. They seem to have given up so I'll try unbanning these IPs.

Our work got blocked as well this week, it seems. I can still browse HN via https though. I don't know if that's intentional or not. :)

Have you tried loading: https://news.ycombinator.com/?

If you tell me the ip I'll whitelist it. pg@ycombinator.com

That sounds more like your work added some new rules to their filter that bans some of the the content here. Using https would encrypt that content and keep the filter from scanning it. Now you just have to hope that they don't ban it by url.

No, I can assure you that's not the case, unless Comcast decided to do it. It appears to be fine now, but for this week we were getting 0 byte responses from port 80, but everything working fine over HTTPS.


Is there a proxy at work? I was experimenting with an HTTP proxy as part of my thesis work a few weeks ago and found similar results. I didn't end up ever solving my problem though...

Isn't it likely that the admins at work blocked HN via a proxy? A few years ago my comments to HN were being mangled by a web proxy at work.

I don't think there's a web proxy—but I could be wrong. Our sites themselves are sites that might be blocked by your average proxy (no, it's not gambling or porn).

I am also checking that hypothesis, but news.yc is the only site returning an empty result from my normal reading list.

Until this morning, I had a hn-related Chrome extension installed that could have been misbehaving.

Our primary ISP (BSNL NIB) at work is entirely blocked as well. Bharti Airtel is able to get access.

I have a similar problem visiting HN from my mobile network. I get a 502 response then.

