
Avoiding Webscraping Throttling Using Python and Tor as a Proxy - bored_hacker
https://boredhacking.com/tor-webscraping-proxy/
======
CWuestefeld
While this is intellectually interesting, I'm troubled by the fact that the
author seems not to have given the slightest thought that he's breaking the
site's T&C, or of how much this abuse costs the service.

I'm particularly sensitive to this because I'm constantly dealing with scraper
bots from competitors that are trying to monitor our pricing. Without our
ongoing policing, and a fair amount of developer time going into it, the
traffic coming from these bots - and hence the amount it costs us to operate
the site - is significantly larger than that of actual customers. Let me say
that again: scraper bots account for more traffic on our sites than do
legitimate customers.

~~~
meritt
> breaking the site's T&C

T&C's have absolutely no bearing on publicly accessible information [1][2][3].
They only apply to registered users accessing login-required portions of your
website or application. Browsewrap does not represent a legally binding
contract, no matter how much you pay your lawyers to write up your irrelevant
T&C's.

[1] [https://www.eff.org/deeplinks/2018/04/scraping-just-
automate...](https://www.eff.org/deeplinks/2018/04/scraping-just-automated-
access-and-everyone-does-it)

[2] [https://www.eff.org/deeplinks/2018/04/dc-court-accessing-
pub...](https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-
information-not-computer-crime)

[3] [https://www.eff.org/deeplinks/2018/01/ninth-circuit-
doubles-...](https://www.eff.org/deeplinks/2018/01/ninth-circuit-doubles-down-
violating-websites-terms-service-not-crime)

~~~
jacquesm
The world is a lot larger than just the United States. Besides that, even
publicly available information can be copyrighted. So the scraping itself may
be legal, storing that information; using it and re-publishing it very likely
is not.

~~~
luckylion
> The world is a lot larger than just the United States.

Where would this be significantly different? For T&C to be valid, you'll have
to agree to them _before_ using the service. Since you generally can't know,
much less agree to, the T&C of a website before using that website, you have a
chicken & egg problem where the law tends to come down on "these T&C aren't
binding" imho.

You're right about the re-publishing though. Storing is likely a different
issue, even browser-caching is storing; iirc there were court decisions in
Germany that essentially argued that even having data in memory is creating a
copy.

~~~
aw3c2
In Germany there was a recent ruling that scraping, even for competing
purposes, is fine.

[https://www.heise.de/newsticker/meldung/Urteil-zu-Screen-
Scr...](https://www.heise.de/newsticker/meldung/Urteil-zu-Screen-Scraping-BGH-
legt-schriftliche-Urteilsbegruendung-vor-2236236.html)

------
mirimir
It's overkill to use Tor for this. And I'm a little surprised that it works at
all well, because Tor exits so commonly trigger CAPTCHAs.

Better, I think, would be using HTTPS proxies. But not free ones, which tend
to get burned down pretty quickly. There are sites that lease private proxies,
and guarantee that they work.

~~~
nurettin
There are such paid proxy services that provide you with a refreshable IP
pool. Unfortunately, you can't control which IP they give you at each request
(or when you can, your request rate drops significantly) so they do not work
with websites which require you to keep a session alive.

~~~
bshipp
I use crawlera, and it does have the capability to establish and maintain a
session. I can't pick the precise initial IP but once the connection is made I
can certainly tell it to keep using that IP.

Sessions are the only way to use crawlera with libraries like cloudflare-
scrape, which pin your authentication to a specific IP.

~~~
nurettin
That is interesting. $1000 tier is a considerable amount, but perhaps one can
find a profitable line of products to justify the price. Of course, this is
too expensive for initial development most of the time, so I'd go for an in-
house network of machines initially.

------
256cats
Well, Tor is usually blocked. If you want something more reliable, use private
proxy services, for example [https://gimmeproxy.com](https://gimmeproxy.com)

------
Thorrez

        <span class="pull-right" id="ipv4">2a0b:f4c1::7</span>
    

How is that an IPv4?

~~~
unnouinceput
xxx.xxx.xxx.xxx and the mask. in his example: 2a = 42; 0b = 11; f4 = 244; c1 =
193; mask = 7 (weird i know) so the IP is actually 42.11.244.193. A short trip
to any free IP locator site says that IP is from South Korea

------
foobar_
If only websites have their data dumps for free instead of html reverse
engineering.

------
nomilk
Could scraping through TOR be risky if site has different content depending on
the visitor's IP? Or does TOR allow you to control for that (e.g. get only IPs
from a specific location)?

~~~
jjjbokma
You can use ExitNodes in your torrc to set a country code, e.g. {us}.

------
paulryanrogers
Not sure how long this will last since there are a limited number of exit
nodes. Many networks throw up CAPTCHA for such nodes.

~~~
wybiral
Ah, yes, the captcha arms race. I've encountered bots that can solve simple
text captchas (especially bots routed through Tor) and as ML gets cheaper and
more accurate I do wonder what the end result will be...

~~~
greglindahl
Well, for Tor in particular, the most likely end result is Tor being banned
from an increasing number of websites.

~~~
mirimir
That already happened.

~~~
Topgamer7
Yeah honestly surprised scraping through tor would work. I imagine most
commercial solutions routinely block exit nodes.

~~~
viraptor
Depending on how you read the embargo laws for your country, you may be
required to block them. If you identify the source of the traffic as a non-
country, dropping it may be what you want to do.

------
bydl0coder
It's higly likely that a web site will greet any connections from Tor with
captcha.

------
lammalamma25
Cool article but couldn't you just spoof your IP using something like socket?

~~~
lammalamma25
Replying to myself here. This will not work because the packet still has to go
back to the spoofed address. What a poorly thought out comment.

~~~
emj
One could test to change ipv6 address, it might work.

------
nurettin
This method will not work for many websites that require you to stay logged
in.

