
Bypassing website anti-scraping protections - jardah
https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections
======
userbinator
_For example, for google.com, you can typically make only around 300 requests
per day, and if you reach this limit, you will see a CAPTCHA instead of search
results._

300 is pretty easy to achieve if you're "Googling hard enough" (make 5
slightly different queries, go through the 20 pages of results it's willing to
show you, repeat 3 times...), and I've seen it trigger far before that if you
are searching for more obscure things. It seems almost hostile to those
searching for IC part numbers, specific and very exact phrases, and just "non
mainstream" content in general.

How sad it is then, that we are told and have internalised the notion that we
should use search engines like Google to find things, and yet it prevents us
from "trying too hard" to find what we're looking for...

~~~
greglindahl
From my experience at blekko, 99.9% of the "people" who go deep into the
results pages for a single query are actually bots. You're a very unusual
user, and there are a lot of bots.

~~~
xenomachina
There's a difference between going deep into the results, and progressively
refining a query. The former is pretty indicative of bot behavior -- humans
rarely go past even the first page of results. I do the latter all the time,
and this frequently gets me Google's captcha, especially if I'm doing
something like using site: and inurl: operators.

------
cm2187
There is an irony in google preventing web scraping given that their business
is pretty much built on web scraping.

~~~
vinceguidry
Why is there irony in that? Anyone can go build a crawler and scrape the web
the way Google scrapes it so they can compete with Google. Google protecting
its site from scraping means you can't compete with Google _using Google 's
own resources_.

That said, automated research fascinates me, I wouldn't want to scrape Google
to make my own Google, but rather to make private repositories of information
that I can then query efficiently. I would love to find any kind of scriptable
search engine access, paid or free. Not entirely sure how to look though.

~~~
PaulHoule
Think different. Try bing, it has an API.

I think bing is close to Google in quality. Some people might even like it
better. On the other hand I think DDG is the Sprint of search engines.

Google used to have a search API and they discontinued it because they said
most of the people who used it were SEO people.

People who do pay-per-click are into A/B testing and other quantitative
testing. Google is all for you doing that if you pay for advertising. Their
mainstay of anti-SEO is doing arbitrary and random things to make it
impossible for SEOs to go at it quantitatively. (They have patents on this!)

One reason so many sites go to a harvesting business model is that once a site
is established you can make the slightest change and then your search rankings
plummet. If you depend on search engine traffic it is a huge risk that you
can't do anything about unless you are about.com (bought a 'competitive'
search engine and just might be able to make an antitrust case against
Google.)

~~~
hangonhn
Can you elaborate more on this statement? "On the other hand I think DDG is
the Sprint of search engines."

I've been interested in switching to DDG for a while but as a former Sprint
customer, that statement scares me but maybe some explanation from you might
understand your opinion better.

~~~
tracker1
I'm not sure about the comparison itself... I've tried DDG several times, I
search for technical things in generic ways a lot. DDG almost never gives me
what I want in the first page. Google almost always does.

~~~
leesalminen
Same here. It’s hard to blame DDG though- Google’s search index of Stack
Overflow is better than SO’s own.

~~~
tracker1
It's not a matter of blame at all... I'd love to see some challengers. In the
end, google knows a lot about me and is really good at delivering personalized
results because of it.

------
learc83
A company I consulted for was using a paid API to handle search.

Despite the fact that the entire site was available in an easy to scrape XML
format, scrapers kept using the search feature.

They were trying very hard to overcome my countermeasures--they had a
seemingly limitless pool of IPs, they were rotating user agent strings, and
they tried to randomize search behavior.

Everytime I implemented a new countermeasure they'd try to find a way around
it. It was maddening because we made everything available for them through the
XML feed. They just wouldn't use it.

~~~
unreal37
You had a paid API, and people wanted the information for free....

Not unexpected I guess.

~~~
bo1024
After searching "algolia" mentioned below, I figured out the misunderstanding.
The company was paying _somebody else_ per search made on their web site. So
every time a scraper called the website's search function, it cost the website
money.

------
eboyjr
> we have developed a solution which removes the property from the web browser
> and thus prevents these kind of protections from figuring out that the
> browser is automated

Eli Grey and I have bypassed your "hideWebDriver()" function[1] in a single
line of code:

    
    
        if (navigator.webdriver || (Navigator.prototype && Object.getOwnPropertyDescriptors(Navigator.prototype)["webdriver"])) {
            // Chrome headless detected - navigator.webdriver exists or was redefined
        }
    

[1]: [https://github.com/apifytech/apify-
js/blob/262a2e604b1adb3d8...](https://github.com/apifytech/apify-
js/blob/262a2e604b1adb3d8ef96579dc1db87bc9077bb0/src/puppeteer_utils.js#L13-L51)

~~~
jardah
Good point. Haven't seen a single detection library do this, but at least now
I know, that I still need to work on alternative solution. Thanks

------
dredmorbius
Since people are asking "why would you do such a thing" or insinuating that
scraping need only be to compete somehow with Google, I'll present a use I've
found quite interesting, that _doesn 't_ seek to replicate or replace Google
search, and which hasn't been readily attainable other than by scraping Google
search results, in part. The tool I've used (crude, but reasonably effective)
has applied numerous attempts to work around bot-detection, some modestly
effective. (Rate-limiting most especially.)

I've found the practice of looking at search-term frequency, across a domain
or set of domains (using the "site"<domain>" Google search filter) to be
useful, for example the "Top 100 Global Thinkers" report linked below.

It uses 100 search terms -- "global thinkers" identified by _Foreign Policy_
magazine -- searched across a set of about 100 domains and TLDs, largely
social media, various journalism (newspaper / magazine), and a few
institutional sites, as well as selected national and other top-level domains.
The result is an interesting profile of where more robust online discussion or
commentary might be found.

[https://www.reddit.com/r/dredmorbius/comments/3hp41w/trackin...](https://www.reddit.com/r/dredmorbius/comments/3hp41w/tracking_the_conversation_fp_global_100_thinkers/)

The full report requires running roughly 100 x 100, or 10,000, Google
searches. I'm finding that it's necessary to space these ~5-10 minutes apart,
which means that the full analysis takes over a month of wall-clock time, from
a single IP.

I've considered several possible follow-ups to this study, including more or
alternate domains, different keywords, and various other variants, but both
the run-time and codeing to bypass bot-detection put me off this.

I've tried reaching out to Googlers I know to see if there's any possible
alternative means of acquiring this information, to no avail. I've also looked
for various research interfaces or APIs, with no joy.

DuckDuckGo and other search sites don't have the rate-limiting (I've used them
for other purposes), but also don't have the (granted, often very inaccurate /
imprecise) match-counts which Google offers.

Putting this out there both as an example _and_ a request for suggestions as
to how I might improve or modify the process.

~~~
shabble
Have you considered some sort of "crowdsourcing" / voluntary botnet type
approach?

The ArchiveTeam[1] have a simple VM image that anyone can use to schedule and
coordinate large site archival jobs that might already address some of teh
issues.

Might be tricky to find people willing to provide resources, but with even a
smallish group it might work out. May need to consider abuse and run multiple
queries and compare results, which might add to the overall request cost.

[1] [https://www.archiveteam.org/](https://www.archiveteam.org/)

~~~
dredmorbius
The thought's occurred.

My approach is sufficiently fluid that this would mean pushing pretty crude
code to a bunch of hosts frequently and on a irregular basis. The runs
themselves are fairly ad hoc.

Being able to directly query a corpus (IA, DDG, Bing, etc.) is another option.

Search across large corpora remains fairly expensive, I can understand
hesitency here.

Nonstandardisation of search APIs across sites is another frustration.

------
DFHippie
I find it odd how little a basic principle enters into this: don't do
something to someone when they make it clear they don't want you to do it.

~~~
icebraining
Eh, the principle might be good (though it's not odd that not everyone shares
the same principles), but one can hold it and still have exceptions. For
example, what about a governmental institution or a public company¹? What
about a semi-public company, like a monopolist utility? What if the uploader
of the data is OK with it, but the site hoster prevents it?

¹ in the sense of owned by the State, not listed on the stock market

~~~
DFHippie
I'm not saying there aren't legitimate reasons for writing scrapers. I've
written plenty myself. It was just odd to see this disregarded entirely.

As for the commonality of principles, game theory explains most of them, so it
isn't more surprising than that we all work with the same prime numbers, say.
A simple principle of reciprocity will produce something along the lines of
"respect other people's wishes".

~~~
icebraining
_It was just odd to see this disregarded entirely._

I can't say I agree. I mean, a serious, interesting essay can certainly be
written on the ethics of scraping. But these short preludes on technical posts
just end up sounding either like a disingenuous legal disclaimer or a preachy
paternalistic tirade.

------
Asooka
Has there been precedent established whether bypassing anti-scraping does or
does not violate the CFAA?

~~~
ksahin
It's a complex subject.

For example, the Linkedin case : [https://arstechnica.com/tech-
policy/2017/08/court-rejects-li...](https://arstechnica.com/tech-
policy/2017/08/court-rejects-linkedin-claim-that-unauthorized-scraping-is-
hacking/)

Craiglist sued some companies too.

To my understanding, scraping can be legal if it's done properly, meaning not
sending too many requests at the same time, and if it does not affect the
underlying infrastructure.

It seems like in the US or in Europe, even if there is any anti-bot / anti-
scraping section in the website's TOS, public data can be scraped. Sometimes,
even "private" data can be extracted using bots. For example, lots of "bank
account aggregators" has won lawsuits against banks.

~~~
Raphmedia
The issue is that if you allowed all web scrapping you could DDOS websites and
get a free out of jail card by telling "oh, we were simply scrapping some data
and it glitched out".

~~~
confounded
That sounds like a possibly-less-time-in-jail-card.

------
dvfjsdhgfv
> there are already anti-scraping solutions on the market that can detect its
> usage based on a variable it puts into the browser's window.navigator
> property. Thankfully, we have developed a solution which removes the
> property from the web browser

Does anyone know what exactly the property in question is?

~~~
zzzcpan
Probably navigator.webdriver, but there are multiple properties.

[https://antoinevastel.com/bot%20detection/2017/08/05/detect-...](https://antoinevastel.com/bot%20detection/2017/08/05/detect-
chrome-headless.html)

[https://antoinevastel.com/bot%20detection/2018/01/17/detect-...](https://antoinevastel.com/bot%20detection/2018/01/17/detect-
chrome-headless-v2.html)

~~~
dvfjsdhgfv
That's why I was wondering. Last time I checked headless Chrome could be
pretty reliably detected in a number of ways, as you say. That they mention
just one variable seems quite odd, given that they position themselves as
specialists in the field.

~~~
jardah
The webdriver property is as far as we know the only one that stays different
if you use non-headless chrome with puppeteer. Rest can be handled by use of
non-headless chrome as mentioned in the article.

But you are right, after reading through it again, this section of the article
should be improved.

------
matthewmacleod
On one hand, it does make a lot of sense that many web publishers want to keep
people from scraping content, given the way that it's often used nefariously,
to violate copyright, or for spam purposes.

But there are totally legitimate reasons to scrape as well. Altmetric
([https://www.altmetric.com](https://www.altmetric.com)), which is the company
I work for, tracks links to scientific research. So when someone on e.g.
Twitter links to a page on nature.com, we want to scrap the page they linked
to and figure out which paper they are talking about (if any). Academic
publishers can be particularly sensitive to scraping, making the endeavour
much more work than it needs to be.

It's a real shame that the web has moved to be so closed off in many ways.

~~~
unreal37
The web is not becoming closed off from users. It's becoming hostile to bots.
Not the same.

------
benologist
At this point you should just consider your HTML/HTTP interface an API because
when you use headless browser technology readily available with any
programming language it becomes exactly that.

~~~
matheusmoreira
The HTML really is the API.

Writing a site-specific browser has always been a fun project for me. It just
pulls the information I want directly from my favorite websites. Maximum
signal-to-noise ratio and I get ad blocking for free.

People think Javascript-based sites are safer, but it's in fact even easier to
access the content because there's usually a programmatic interface available.

------
systematical
Pretty much rate limit and proxy through tor is all this article needed to
say. Sometimes you need to fake some cookie data or get a session set first.
At least for static data.

For dynamic sure puppeteer if you have too but my god the exceptions and stack
traces need some work.

But most websites don't enact protections because it's generally not worth the
opportunity cost. So you really just scrape with your LOC.

If you move money or can't code then mozenda.

------
paulie_a
I'm glad eBay never implemented that, I wrote a scraper that hit them 5-6
billion times over a couple of years.

~~~
uptown
What was your objective?

~~~
paulie_a
To find a particular product that could be repaired and resold

------
erikrothoff
I was playing around with the idea of using Tor to get around IP blocks. I
played around a bit with code but the Tor binary dependency was a bit much for
my use case. Curious to know if anyone else tried this?

~~~
always_good
Everyone else has the same idea which is why it often makes sense to block Tor
outright.

~~~
drawnwren
Yep. Tor gateways are usually included along with VPNs and AWS ips in most
basic ip blocklists.

------
rosha
Scraping on high volumes any serp is pain if your business relies on it and
most services out there do not work on high volumes or they work and crazy
expensive.

I have checked few solutions out there, I am using now proxycrawl. Developers
of their api helped me get a very high volume of Serp data from different
search engines like Yandex, google and yahoo and bing. I also use them for
Javascript crawling as for our project we need lots of content which is
rendered via javascript. I am amazed of how their API endpoint works. It is
basically sending a URL to their API and you are good to start. Make sure to
contact them for some sites as they do not allow you to crawl the world by
default unless you prove your use case, they liked my product and that is how
it got started. I've really having successful experience with it so I totally
recommend, you basically communicate with developers who does lots of work to
make it happen. As I am mainly in JS I asked for a Node JS package and they
just built it open sourced. [https://github.com/proxycrawl/proxycrawl-
node](https://github.com/proxycrawl/proxycrawl-node)

~~~
jorge_gonzalez
It would be interesting to know what technologies they use to scrape on high
volume for 0.005 US cent a successful request. I checked this package and it
looks decent, i like dependency free libraries. I'll check their API for Bing.
Thanks

