
How to scrape anything on the web and not get caught - karolmajta
https://tinyendian.com/articles/how-to-scrape-the-web-and-not-get-caught/
======
CapacitorSet
That's a very low-quality article, in my opinion. It takes an entire article
to show how to use a simple tool and how to fetch a list of proxies, uses
Makefile when a shell script would do just fine, and exaggerates the title.

~~~
ospider
I do web crawling for a living, the method mentioned in the article does not
work for most sites

~~~
Bromskloss
Do most sites have scraping detection at all? Are they even opposed to
scraping?

~~~
codefined
On a website I'd written we had pseudo-randomly generated URLs to show dynamic
content (it was a game, the URL contained parameters). On each page we had
this little widget that included five random configurations people might like
to try.

A few times our website went down due to the load going >30, eventually I
discovered Google was doing something funky, adding the dynamic domains to the
"robot.txt" file fixed the issue. Then some other search engines / scrapers
seemed to run into the same issue and started requesting hundreds of thousands
of URLs per day (these pages were dynamically generated and took a moderate
amount of compute power).

We eventually did have to implement basic anti-scraper rules because it was
degrading the user experience.

------
buildbuildbuild
Careful. Using open proxies could be considered unauthorized access in some
jurisdictions. Some of these proxies were installed without the user's
permission.

This is my favorite consensual alternative:
[https://github.com/mattes/rotating-proxy](https://github.com/mattes/rotating-
proxy)

~~~
darpa_escapee
This solution will not work with HTTPS. There are other alternatives and it's
easy to roll your own.

~~~
endisukaj
For example?

------
robsun
Some time ago I was looking for an apartment to buy. Sites in my country are
bloated and terribly slow. Checking several offers took minutes. Moreover I
live in a city where good offers are sold the same day they are published.

I decided to run scrapper to fetch all the data about available apartments in
my city. Thanks to that I was able to browse offer at speed of tinder. It took
me a few hours to write all the stuff, it saved me probably weeks.

To avoid getting caught I decided to setup TOR on my raspberry pi and use it
as a proxy. It was extremely easy and reliable. Sites were so slow I didn't
notice significant performance drop. I didn't care about changing proxies
because TOR made it for me.

Except that it is good idea to change User-Agents and add some random delays
between calls. Luckily for this case it was enough.

------
qiqitori
I had to scrape past an access limit per IP once, and just went with IPv6
addresses, of which IPv6 users have plenty. (This was almost five years ago,
so it's conceivable that some services would have wisened up a bit and block
the entire IPv6 block now.)

------
ikeboy
>after introducing proxies my crawl times grew by an order of magnitude from
minutes to hours

Yeah, same experience. Right now I use luminati.io datacenter IPs that work
ok, anyone know of a cheaper option that works well? Scraping tens of millions
of pages a month.

~~~
tomarr
I suppose the economics of it comes in to play in a similar vein to mailchimp,
the lower the pricing the more scammier clients and the more IPs they lose to
blacklists.

~~~
ikeboy
Not really. Mail is default bad, you need to build up trust just to get a tiny
amount of deliverability. Fetching webpages is default good until you're
detected as bad.

The thing is, a lot of scraping goes unnoticed. Maybe you get an extra
thousand hits here and there. But every spam campaign gets noticed and results
in some percentage of spam complaints from users.

------
pdkl95
I assume simple BFS and DFS traversal behavior shows up brightly in access
logs, making detection more likely. Does it help to use Random First
Search[1]? Or is it better to attempt emulating human actors (requires much
more develop effort)?

[1]
[https://bl.ocks.org/mbostock/11161648](https://bl.ocks.org/mbostock/11161648)

------
dis-sys
This article has the lowest quality ever seen on HN. The title is hugely
exaggerated.

Using a list of proxies and hope that is enough to scrape _anything_ on web?

------
sram1337
I don't know if scrapy handles this, but I've run into issues with sites
fingerprinting my browser. Proxies help, but there are other ways to identify
site visitors aside from IP addresses.

~~~
JeanMarcS
Rotating user agents ?

~~~
dredmorbius
Minimally effective, or outright detrimental, in my own limited testing.

------
your-nanny
What sorts of real-world and legitimate/ethical use cases are there for
wholesale repeated scraping?

~~~
lazycouchpotato
Statistics I'd say is one of the useful cases of scraping.

Back in 2013, a guy scraped the results of about 150,000 students giving their
10th grade finals of a particular examination board in India. He showed that
not only was there no privacy of student's marks because the roll numbers were
all linearly incremented, but there was also mass-scale manipulation of marks
going on.

The concept is simple but it's a very interesting read.

[https://deedy.quora.com/Hacking-into-the-Indian-Education-
Sy...](https://deedy.quora.com/Hacking-into-the-Indian-Education-
System?srid=d5K&share=1)

I was one of the 150,000 kids that gave those exams back in 2013 :)

~~~
reallymental
Facinating read, I cannot find out more about what happened to him after he
was accused of 'hacking' the govt systems, do you have any more sources that
sheds some light on that?

~~~
lazycouchpotato
I tweeted out to him and he responded.

[https://twitter.com/debarghya_das/status/988178000914022400](https://twitter.com/debarghya_das/status/988178000914022400)

------
Theodores
I have no idea what tools are available for denying access to web scrapers,
this I should know given that I have built a few websites and know what to do
to get pages serving quickly. Somehow I missed the memo on how to set your
site up to not be scraped. Is there an nginx setting for that?

This could be interesting for people that do scrape sites to know too, what
basic reasonable measures can one take beyond looking at logs and doing IP
bans?

~~~
amelius
Captchas are one (drastic) option.

~~~
nojvek
Any image based captcha where one needs to identify words from an image can be
easily broken by algorithms now.

Google would be the leader with re-captcha, but as a human, I fail a whole
number of them. They are a very annoying experience to your users.

------
Bromskloss
> proxy-lists getProxies --sources-white-list="gatherproxy,sockslist"

Is there a reason for using only gatherproxy and sockslist? There are more
lists [0] available.

[0] [https://github.com/chill117/proxy-
lists/blob/4bb8064703b09ee...](https://github.com/chill117/proxy-
lists/blob/4bb8064703b09ee07488f04eaf992431fdbe7761/readme.md#supported-proxy-
lists)

------
wiradikusuma
I use paid service proxy bonanza ($12/mo for 2 IPs), and build my own as well
using squid ($5/mo DigitalOcean).

~~~
lemagedurage
You pay more per IP than per server!? Why don't you get 3 DO instances then?

~~~
chrismeller
I can't speak for that particular service, but in other similar ones I've
looked at and used in the past, you're paying for a certain number of IPs _at
a given time_.

So, for instance, they have a pool of servers that have 1000 IPs available.
Your account allows connections to go out over 2 of those at a time. If
something happens (like one gets banned by whatever service you're scraping),
you can get a different set of 2 IPs and keep moving.

While you're still paying a relatively high price for what you're consuming
(predominantly bandwidth in this case), you're paying for the flexibility.

------
UniZero
Contrary to popular belief, a lot of high traffic sites can be scrapped by a
single IP without hitting access limits.

------
mattcoles
In what way is the article not scummy as hell? You shouldn't waste Jenkins
server time with this...

------
amelius
Could this be a solution?: run a website, and let your visitors do the
crawling.

~~~
tzahola
Won’t work. Adversaries can taint your data by sending back fake results.

~~~
amelius
Could multiple downloads and a "consensus" algorithm solve this problem?

~~~
tzahola
Kinda, but it’s far from trivial as you would need some sort of tolerance when
comparing sites with dynamic content.

