
Don't tell StackOverflow I'm a hacker (they think I'm a bot) - bambax
http://blog.medusis.com/dont-tell-stackoverflow-im-a-hacker-they-thin
======
spolsky
Hi!

First of all - team@stackoverflow email goes straight to Jeff, and he replies
to it personally. There's no secret team of people pretending to be Jeff to
deal with customer service requests :-)

I personally believe that any company--or country--that thinks it's a good
idea to filter web traffic deserves just about as much collateral damage as
they get. If a company or country wants to cut its workers off from the
knowledge they need to do their job, honestly, I would like to see that
company or country get beaten to a pulp by the forces of evolution.

But... I feel for you. There are lots of websites that use our API to provide
alternate views on Stack Overflow at other addresses, for example,
<http://sa.column80.com/?api=0> has a nice lynx-compatible browser.

~~~
conductor
I get "Fatal error: The user ID must be numeric in
/home/markness/column80-sofu/phpstack.class.php on line 419" when trying to
read anything

~~~
m0tive
I think you need to enable cookies. That fixed the error for me.

------
d_r
I'm surprised that people are scraping their content via bots, given that it's
licensed under CC and freely available to download.
(<http://blog.stackoverflow.com/category/cc-wiki-dump/>)

~~~
Tyrannosaurs
I'm guessing that these are people who either want something more up to date
than the data dump (which is monthly if if I recall) or who have scraping
infrastructure set up and working and it's easier to use that than to work
with the schema.

~~~
ivoflipse
It's bi-monthly nowadays. Even so, you can use the API instead, its there for
a reason.

Its too bad for this guy that he simply can't access SO the normal way, but
that's not SO's fault

~~~
Tyrannosaurs
I agree. SO is great but we all need to be realistic about situations where
for one reason or another we're edge cases.

------
mootothemax
Whilst it's not ideal, currently experiencing spammer abuse had lead me down
similar paths on one of my web apps. Sadly solutions like this free up time
for more core functionality - which I'd much rather spend my time developing,
considering that abuse prevention is not a feature you can charge for exactly.

~~~
bambax
Yes, I understand that, and scrapers certainly are a big pain; but shouldn't
SO extend a little more effort to avoid false positives...?

~~~
bxr
It seems possible for them to set up the block as a response to request
rate/patterns, but as I am unfamiliar with EC2 I have to ask: how hard is it
to get a new IP address? If it a matter of stopping an instance and starting
another (or easier), blocking all of EC2 may be their only move.

Once the scraper isn't tied to a single IP, I don't see how to filter out
abusive requests from the legitimate ones. It will cause the scraper to
emulate real-user behavior more and more to defeat the barriers. At the end of
the cat and mouse game SO's only option is to block the IP range, or allow
scraping from EC2.

------
almost
It probably saves them a whole load of pain. And you are rather a special
case, how many non-bots try to access Stack Overflow via a EC2 do you think?

There are plenty of other cheap VPS providers anyway...

~~~
bambax
> There are plenty of other cheap VPS providers anyway...

Certainly, but what am I supposed to do, hop from provider to provider
according to who (blanket-)blocks what...?

> you are rather a special case

I setup this VPN, following instructions from a blog post from 2009; I'm
guessing there are many people who did the same?

But if I'm going to die as collateral damage, I don't intend to succumb
quietly! ;-)

~~~
subway
While I don't agree with an employer or job site filtering web access,
following a two year old blog post on circumventing a client's security
measures borders on irresponsible and stupid. You may think of it as only
bypassing some web filtering, but by establishing a VPN connection across
their border, you've potentially opened their network up to any insecurities
in your machine or your EC2 instance with a direct link past their border that
your client is completely unaware of.

If you need to perform work on a client site that isn't permitted over their
network, you should bring your own connectivity. I carry a cheap cdma modem
for that very purpose. Just don't expose your client to additional risks so
that you can read SO.

------
barrkel
If you just need proxy support to get around restrictions, EC2 is an expensive
way of doing it (assuming you're using it 8 hours a day).

You can get a VPS (<http://www.lowendbox.com/> tracks various offers) for as
little as 3 USD/month, and have it 24/7/365. Even if you're using an EC2 micro
instance, you'd have to use it less than 150 hours a month (out of 730) to get
ahead.

~~~
pama
Isn't the EC2 cost 1.68 USD/month for your assumed usage pattern? (0.007
USD/hour * 8 hour/day * 30 day/month)

~~~
barrkel
I took 0.02 USD/hour for on-demand micro instance; 0.007 USD/hour also
requires a $52/yr or $82/3yr fixed cost on top.

I have no idea how to predict the spot price, and I haven't tried to collect
statistics on it, so I (personally) wouldn't choose it for this kind of
application, where you expect it to be there all the time, not just when it
happens to be cheap.

I mean, 3 USD/month, or 30 USD/year, is cheap enough to pay once and forget
about, rather than worrying about turning it on or off.

~~~
yummyfajitas
_...so I (personally) wouldn't choose it for this kind of application, where
you expect it to be there all the time, not just when it happens to be cheap._

It's not very difficult to set up a script to spin up an on-demand instance if
your spot instance gets nuked. Or to just buy another spot instance at a
higher price.

<https://github.com/boto/>

------
ck2
How are the entire EC2 IP blocks known?

I doubt they rDNS every connection that comes in, too expensive.

Ah, here's a list but it's from 2007, might be out of date

[https://forums.aws.amazon.com/message.jspa?messageID=106925#...](https://forums.aws.amazon.com/message.jspa?messageID=106925#106925)

~~~
tedunangst
You only have to to lookup once for each /24 and presumably you'd cache the
response for a while. The lookup doesn't need to be done online either.

------
MichaelGG
StackOverflow once blocked the entire country I'm in (would just return a
403). I emailed them and was told they can re-enable it:

"It's not a problem, I just need reassurance that there won't be RssNotifiers
hitting us 1,000 times a day and pulling down uncompressed data."

To be fair, I'm sure they have more important things to do than work on
complex rate limiting and abuse detection code, especially for edge cases like
small countries or EC2.

------
eof
You probably have ssh access to more machines than just that EC2 instance?
Tunnel your traffic through an ssh account.

ssh -D 10001 you@someplace.com

So long as someplace.com is not blocked by SO you can bind your browsers
traffic to your now-running local socks proxy in the internet connectivity
section. Set all traffic to go through 127.0.0.1:10001

------
lukev
I've hit this issue as well, since we use a lot of VMs on EC2.

------
tygorius
_“We’ll just block all of EC2” seems not only excessively broad but, well,_
lazy.

This conclusion bugged me a bit, in part because I'm old enough to remember
when programmers considered laziness to be a virtue. But also because it's not
evidence-based. If it were a problem faced by a significant portion of SO
readers, I suspect they'd find another way to address it. But if it's a
problem only experienced by a couple of readers a year, then spending much
time on it would be an "industrious" misuse of resources.

------
exratione
EC2 is a big issue for a lot of things. For example, while I was maintaining a
credit card donation form for a non-profit, a large chunk of fraudulent
submissions came from EC2 addresses - actually more than came from African
ISPs.

So I too blocked the whole of EC2; the logic is pretty straightforward, in
that a legitimate customer originating requests from there is highly unlikely
compared to the other options.

------
emeltzer
Same exact situation here--need a VPN to access some sites from China. No
worries though--just make an exclusion rule for SO!

------
zzo38
Adding the "api" parameter is also useful if you want to copy the URL to
someone else. So you can write such things as:
[http://sa.column80.com/?q=1860&api=77](http://sa.column80.com/?q=1860&api=77)
Now it is a URL that can be copied to someone else that is not on the same
session.

------
zzo38
I found out how to do without cookies: Add the "api" parameter to the query
string when retrieving a message. You might have to do manually every time,
unless you can write a program to do for you.

------
ry0ohki
EC2 is always marked as spam if you try to send email from an instance too. It
would be nice if Amazon could create some tier of white listed, verified non-
spam instances or something.

~~~
samuel1604
they have spam as a service for that <http://aws.amazon.com/ses/>

------
natabbotts
Couldn't agree more. Blanket banning of something is rarely good.

~~~
jon_r
You'd think they could rate limit requests coming from EC2 ip addresses rather
than blanket banning, but I guess value wise its cheaper to block all of them
given the low likelihood of it being a real person.

------
huntero
I often tunnel my traffic through an EC2 instance when I'm on public WiFi and
I've run into this problem with a number of sites.

Most notably (and annoying when out in the city), Yelp.

~~~
bambax
> Most notably (and annoying when out in the city), Yelp.

Confirmed! (but I don't use Yelp so I wouldn't notice)

------
zzo38
Hay! I try to use it on gopher?

------
Hisoka
I'm with StackOverflow on this one. If I was then, I wouldn't waste my time
fine tuning the detection, and just block all of Amazon ECS.

------
trustfundbaby
Its a business decision.

Easy to do ... instead of having someone spend time trying to make it work for
0.005% of Stack overflow use cases (yes. I pulled that number entirely out of
my rectum. chill out.) ... sucks, but I get it.

------
jrockway
_“We’ll just block all of EC2” seems not only excessively broad but, well,
lazy._

Well, yeah. It's a Windows application, and lazy is the name of the game on
Windows.

~~~
bplesser
Why not just use TOR with your EC2?

