
Web Scraping: Bypassing “403 Forbidden,” captchas, and more - foob
http://sangaline.com/post/advanced-web-scraping-tutorial/
======
chatmasta
Note that 99% of time, if a web page is worth scraping, it probably has an
accompanying mobile app. It's worth downloading the app and running
mitmproxy/burp/charles on the traffic to see if it uses a private API. In my
experience, it's much easier to scrape the private mobile API than a public
website. This way you get nicely formatted JSON and often bypass rate limits.

~~~
jedberg
How do you deal with the issue that most mobile apps have a baked in security
key for their private API? Or am I being naive to think that most apps have
that?

~~~
dsacco
You reverse engineer the application, or you run it in a debugger.

If the app features certificate pinning to block MITM eavesdropping through
your own proxy, you either use one of the XPosed Framework libraries that
removes it on the fly in a process hook, or you decompile the app, return-void
the GetTrustedClient, GetTrustedServer, AcceptedIssuers, etc. functions.

If it features HMAC signing, you decompile the app, find the key, reverse
engineer the algorithm that chooses and sorts the paramaters for the HMAC
function, and rewrite it outside the app. If the key is generated dynamically
you reverse engineer that too, and if it's retrieved from native .so files
you're going to have a fun time, but it's still quite doable.

All they can do is pile on layers and layers of abstraction to make it
painful. They can't make the private API truly private if it requires
something shipped with the client.

~~~
sametmax
The initial idea was to make your life simpler by parsing JSON instead of
HTML. Now we are decompiling binaries. Somewhere on the way, we got lost.

~~~
diminoten
Once you do the one-time work of pulling out the key, you can just add
something like, "secret_key=foobar" to your requests, and you're back to
happily parsing JSON.

If they keep changing it up, I'm sure you could automate the decompiling
process. The reality is that this technique is security by obscurity at its
core, and is therefore never going to succeed.

~~~
charlesdm
Skype is probably one example where it took developers 10+ years to figure out
how the app worked.

~~~
JustSomeNobody
Did it take that long to do it or did it take that long for someone to care to
do it? I mean, it's Skype.

------
thefifthsetpin
Better solution: pay target-site.com to start building an API for you.

Pros:

* You'll be working with them rather than against them.

* Your solution will be far more robust.

* It'll be way cheaper, supposing you account for the ongoing maintenance costs of your fragile scraper.

* You're eliminating the possibility that you'll have to deal with legal antagonism

* Good anti-scraper defenses are far more sophisticated than what the author dealt with. As a trivial example: he didn't even verify that he was following links that would be visible!

Cons:

* Possible that target-site.com's owners will tell you to get lost, or they are simply unreachable.

* Your competitors will gain access to the API methods you funded, perhaps giving them insight into why that data is valuable.

Alternative better solution for small one-off data collection needs: contract
a low-income person to just manually download the data you need with a normal
web browser. Provide a JS bookmarklet to speed their process if the data set
is a bit too big for that.

~~~
pjc50
> pay target-site.com to start building an API for you.

When has that ever worked?

~~~
benjamincburns
I can't cite specific examples because the ones I know about formed
confidential business relationships, but I can say with confidence that this
works All. The. Time.

That said, if you're some small-time researcher who can't offer a compelling
business case to make this happen, then it won't be worth their time and
they're likely to show you the door. [Note: my implication here is that it's
not because you're small time, but it's because by the nature of your work
you're not focusing on business drivers which are meaningful to the
company/org you're propositioning].

Edit: Also be warned that if you're building a successful business on scraped
personal info, you're _begging_ to be served w/ a class action lawsuit (though
take that well-salted, because IANAL and all that jazz).

------
nip
Scrapy is indeed excellent. One feature that I really like is Scrapy Shell
[1].

It allows to run and debug the scraping code without running the spider, right
from the CLI.

I use it extensively to test that my selectors (both CSS and XPATH) are
returning the proper data on a test URL.

[1]
[https://doc.scrapy.org/en/latest/topics/shell.html](https://doc.scrapy.org/en/latest/topics/shell.html)

~~~
brilliantcode
A few things turn me off about Scrapy is that it feels over engineered for
what it does. Why do I need an entire framework?

I'm taking on technical debt to access data I don't have programmatic access
to.

CSS/Xpath are very fragile. You most likely will be changing them in the
future.

~~~
eli
> _CSS /Xpath are very fragile. You most likely will be changing them in the
> future._

Genuinely curious what the alternative is

~~~
brilliantcode
I've been doing research on this but it's not clear whether this problem is a
pain for enough number of businesses to justify further investments.

I often feel like web scraping is a commodity without understanding any of the
inherent technological complexities and challenges.

Very discouraging field to be in, especially when people claim to have pain
but are unwilling to pay very much for it or show appreciation for the effort
that goes into it.

edit: thanks for the downvotes. perfect illustration of how innovation is
punished and unrewarded in this field.

~~~
jlgaddis
FYI, I only downvoted you after you complained about downvotes.

~~~
brokenmachine
I only downvoted because there was no alternative offered, just complaining
about how underappreciated scraper creators are.

 _> perfect illustration of how innovation is punished_

I see no innovation in your post.

The complaining about downvotes was just the icing on the cake, cementing the
downvote.

------
jlgaddis
Here's an idea (although probably an unpopular one around here): if a site is
responding to your scraping attempts with 403s -- a.k.a. "Forbidden" \-- stop
what you're doing and go away.

~~~
skinnymuch
This is a very obvious thing to say. Perhaps it's needed to be said, I don't
know -- It's just a very obvious counter.

~~~
jlgaddis
Yeah, it _seemed_ obvious... but judging by all of the comments here on how to
"bypass" 403s, it actually wasn't obvious at all.

~~~
skinnymuch
I meant it is obvious in that everyone knows that. But they'll still want to
bypass it. So everyone is completely aware of it.

------
superasn
The web scraping tool of my choice still has to be WWW::Mechanize for Perl.

P.S. I wrote a WWW::Mechanize::Query ext for it so that it supports css
selectors etc if anyone is interested. It's on cpan.

~~~
3pt14159
Same in Ruby land. That with inspector gadget and you're golden.

------
Lxr
I have done a lot of scraping in Python with requests and lxml and never
really understood what scrapy offers beyond that. What are the main features
that can't be easily implemented manually?

~~~
makmanalp
Pluggable parsers, automatically good error handling and spidering
functionality (finding and queueing new links to scrape), great logging,
progress stats, exports, pause/resume functionality, and a million other
goodies that are seemingly "trivial" but really you don't want to rewrite them
every time you write a scraper.

edit: Especially if your scraping jobs take a LONG time - days and weeks, this
stuff is extra handy. Might I add a great debugging environment (scrapy
shell), error handling, rate limiting, respecting robots.txt, so much more.

~~~
nostrademons
How much benefit does the spidering/progress/pause/resume functionality give
if you're not just spidering every link on the site, but have complex logic to
determine exactly which links to crawl and in what order? Does Scrapy provide
convenient extension hooks to change the crawl algorithm?

~~~
un-devmox
I haven't had the need to use pause/resume but I do incorporate logic (not
necessarily all that complex) in determining which links to crawl. It is very
easy to do within each spider especially with how the framework uses
generators. It is also easy to extend the pipelines for pre and post
processing.

As others have said, managing a project with Scrapy is super easy and highly
configurable with sane out of the box settings.

------
foxylion
I'm curious what others use to scrape modern (javascript based) web
applications.

The old web (html and links) work fine with tools like Scrapy, but for modern
applications which rely on javascript this does no longer work.

For my last project I used a chrome plugin which controlled the browsers url
locations and clicks. Results where transmitted to a backend server. New jobs
(clicks, change urls) where retrieved from the server.

This worked fine but required some effort to implement. Is there an open
source solution which is as helpful as Scrapy but solves the issues provided
by modern javascript websites/applications?

With tools like Chrome headless this should now be possible, right?

~~~
jakubbalada
Disclaimer: I'm a co-founder of Apifier [1].

It's not an open source, but free up to 10k pages per month. And it can handle
modern JS web applications (your code runs in a context of crawled page). You
can for example scrape API key at first and then use internal AJAX calls.

There's also a community page [2] where you can find and use crawlers made by
other users.

[1] [https://www.apifier.com](https://www.apifier.com) [2]
[https://www.apifier.com/community/crawlers](https://www.apifier.com/community/crawlers)

~~~
brilliantcode
interesting. are you seeing any product/market fit for this?

~~~
jakubbalada
We see a lot of users who needs data from the web or APIs for sites which
doesn't have one. Just not all of them can code and we have to scale custom
development.

~~~
brilliantcode
Are these developers? Business people? I'm curious because we've been
searching for a tool like this for a while but ultimately management thought
it was a bad idea to rely on scraping, there's simply no replacement for a
REST api.

~~~
jakubbalada
Both - developers on a free plan using own RSS for sites without one and
business people (mainly startups) building their products on top of Apifier.

Typical use is an aggregator that needs common API for all partners who are
not able to provide it. So they have running API on Apifier in an hour. It
might break once in a while - than you have to update your crawler (not that
often if you use internal AJAX calls).

~~~
brilliantcode
I see, so there's not much value beyond startups and bootstrappers.

I feel like it's a hard sell to enterprises. Scraping is viewed inferior to an
API so it makes sense for enterprises to just pay the target website for
access to the data.

~~~
jakubbalada
It's also hard to get direct access to the data.

But you're right it's a hard sell to enterprises although we have some (e.g.
real estate developer creating pricing maps)

------
m00dy
I use greasemonkey on firefox. Recently, I have written a crawler for a major
accomondation listing website in Copenhagen. Guess what? I got a place to live
right in the center in 2 weeks. I love SCRAPERS I love CRAWLERS.

~~~
ben_jones
Well the problem is when someone scrapes ALL the good listings then pre-
purchases them for resale at double the cost.

~~~
Raphmedia
How is it different than paying 50+ low-wage remote workers to "scrape" the
phonebook for you and then using the information acquired for profit?

~~~
greglindahl
One difference is that Feist v. Rural Telephone says that the data in a
phonebook can't be copyrighted.

[https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R...](https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._Rural_Telephone_Service_Co).

~~~
Raphmedia
What about using those employees to "crawl" the web for you then?

~~~
greglindahl
I suspect it's roughly the same as a crawer -- same issues of fair use,
TOS/CFAA, etc -- but likely there's no expectation that humans will read and
follow robots.txt.

------
janci
I use Java with simple task queue and multiple worker threads (scrapy is only
singlethreaded, although uses async I/O). Failed tasks are collected into
second queue and restarted when needed. Used Jsoup[1] for parsing, proxychains
and HAproxy + tor [2] for distributing across multiple IPs.

[1] [https://jsoup.org/](https://jsoup.org/) [2]
[https://github.com/mattes/rotating-proxy](https://github.com/mattes/rotating-
proxy)

~~~
janci
Hardest part was synchronization

\- to end the main thread only if all tasks are done

\- when every running task can produce multiple new tasks

\- with limiting the maximum number of running threads

\- always running the maximum nubmer of threads if possible

semaphores to the rescue

~~~
tokenizerrr
Doesn't ThreadPoolExecutor take care of all of that if you store the returned
Future from the submit method? Then you just have the main thread wait for
those.

------
jacquesm
Note that in some places this constitutes breaking the law.

~~~
beejiu
How is this any different from Google doing it? It is okay for Google to crawl
the Internet, but not okay to crawl Google Play? Google raising such an
objection would be an ultimate irony.

Edit: On second thought, I guess you are referring to overcoming 403s and
Captchas?

~~~
cmdrfred
Unauthorized access, if you access the service in violation of their TOS then
potentially they have a case against you. I'm not aware of it ever going to
court in a case where they didn't also send a cease and desist.

~~~
cryptarch
I'm afraid that as long as explicit agreement is not required to make a TOS
binding we'll be dealing with this crap.

~~~
cookiecaper
Yes, but see _QVC v. Resultly_ , where the robots.txt was considered binding,
not the human-readable TOS.

We are getting small wins, but it's going to be slow going until we can get
Congress to adjust both the CFAA and the Copyright Act, or until we can get
SCOTUS to seriously alter the way these acts have been interpreted with
reference to internet access.

~~~
brilliantcode
Not quite. [https://www.law360.com/articles/757906/qvc-website-crash-
sui...](https://www.law360.com/articles/757906/qvc-website-crash-suit-against-
resultly-gets-trimmed)

Basically, all damage claims are null because QVC & Resultly never entered
into a mutual agreement. You can write whatever the fuck you want in your ToS
but it's not law binding.

> Judge Beetlestone also rejected QVC’s claims that Resultly violated the
> Computer Fraud and Abuse Act by knowingly and intentionally harming the
> retailer when Resultly caused the shopping network's website to crash,
> reasoning that the tech company and QVC both could only earn money if the
> site was operational.

I see you are back on the FUD train surrounding web scraping but there's only
very specific case where your fears materialize: "When you receive C&D from
said website, do not continue scraping". Such was the case for Craigslist vs
3Taps.

Please do not cite legal resources and grossly twist realities to spread FUD.
If you don't want to be web scraped, simply do not put it online.

~~~
cookiecaper
>Basically, all damage claims are null because QVC & Resultly never entered
into a mutual agreement. You can write whatever the fuck you want in your ToS
but it's not law binding.

I'm pretty sure that's what I said re: the ToS? That's only _one_ element of
the case (breach of contract). You are correct that in this case, browsewrap
was not considered applicable. There have been a few other cases where it
wasn't too, as in _Nguyen v. Barnes & Noble, Inc._, but there have been cases
where it _was_ , as in _Hubbert v. Dell Corp._. Also note that most cases re:
browsewrap do not challenge the viability of automatically entering an
agreement by clicking around the site, but rather argue that the notification
was simply not prominent enough. It could be worked around by moving the
notice into a more prominent location on the page.

The other element is CFAA, and referring to the wide-open robots.txt helped
Resultly establish that they were attempting to act in good faith and were not
maliciously damaging QVC's systems and not exceeding authorized access to the
computer system.

>Please do not cite legal resources and grossly twist realities to spread FUD.
If you don't want to be web scraped, simply do not put it online.

I'm not trying to spread FUD, I'm just trying to make it clear that the legal
situation is precarious. Google has clearly shown that if you are able to
build your coffers and reputation faster than you can incur lawsuits, you can
win on this. In fact, lots of big companies begin that way, and become big
companies merely because they were lucky enough to get big enough to stand up
for themselves before the legal threats started coming in the door.

It's understood that you'll be scraped if you put it online. That doesn't mean
scraping is legal.

You may be confused here -- I'm not a publisher trying to stop people from
doing this. I'm an entrepreneur whose business depended on scraping data from
a specific source. That business got destroyed when they chose to dispatch
their law firm against us.

The point of repeatedly discussing this on HN is to make the legal situation
clear so that people work to change it, and to make sure people who are going
into similar ventures are informed about the legal risks associated with them.

As I said on another post, I am not a lawyer, and this is according to my
layman's understanding. No one should misinterpret my posts as legal advice.
I'm not going to copy and paste this disclaimer into every post I make because
it should be implicit, and a few IANAL disclaimers is plenty.

~~~
brilliantcode
Well the difference is pretty clear between your case and the rest. Developers
scraping a website isn't going anywhere. A business reliant on scraped data is
making money off of it. That will lend you in precarious situation more.

My criticism was that you mixed in service providers and tool providers that
enable businesses to make money off scraped data-the vendor cannot be held
responsible for misbehaving clients, the best it can do is cut them out when
requested by external parties. Toyota doesn't appear as witness to vehicular
man slaughter cases. It's a car to take you from A to B but it's not Toyota's
fault if the customer runs over something between those two points and not
it's intended design (QVC vs Resultly).

It also doesn't help that there are pathological web scrapers who simply does
not have the money to do anything fruitful so they will bootstrap using any
means necessary and plays the victim card when they are denied. This
particular group is responsible for majority of the litigations. People who
otherwise have no business by piggybacking off somebody else using brute force
to bring heat to everyone involved.

~~~
cookiecaper
>Well the difference is pretty clear between your case and the rest.
Developers scraping a website isn't going anywhere. A business reliant on
scraped data is making money off of it. That will lend you in precarious
situation more.

Developers presumably scrape websites because the data is of some value to
them, frequently commercial value. Google's entire value proposition is based
on scraped data, and it's one of the most valuable companies on the planet.
The way the data is used is not necessarily relevant to whether the act of
scraping a web page violates the law or not -- several more basic hurdles
involving access, like the CFAA and potential breach of contract depending on
whether the facts of the case are such that the court holds the ToS
enforceable, have to be overcome before the matter of whether one is entitled
to utilize the data obtained becomes the hinge.

>My criticism was that you mixed in service providers and tool providers that
enable businesses to make money off scraped data-the vendor cannot be held
responsible for misbehaving clients, the best it can do is cut them out when
requested by external parties.

 _3Taps_ is one of the most prominent such cases and it was just the type of
tool that you're claiming wouldn't be held accountable. 3Taps's actual client
was PadMapper, but since 3Taps was the entity actually performing the scrape,
they were the party that was liable for these activities.

The lesson we've learned from 3Taps is that scraping tools _might_ be OK if
they strictly observe any hint that a target site doesn't want the attention
and cease immediately, but there's really no guarantee either way.

Most people won't sue if you adhere to a C&D, not because they _couldn 't_ do
so and win, but because it's much cheaper to send a C&D and leave it at that,
as long as that settles the issue moving forward. Litigation is very slow and
expensive.

~~~
brilliantcode
3Taps became liable because they put their neck out for PadMapper even after
they received written letters.

It was a poorly executed business strategy because they were up against
powerful legal team.

Best thing to do if you receive C&D or requests to stop scraping, best to not
continue and just let that customer go.

~~~
cookiecaper
You _can_ be sued (and lose) for damages incurred by illegal activity whether
the aggrieved party sends a notice or not. It's not the plaintiff's job to let
you know you're breaking the law, and they're entitled to damages whether you
know you're breaking the law or not.

In fact, it's _assumed_ that defendants weren't intentionally breaking the
law, which is why when it's clear that they _were_ , courts triple the actual
damages for willful violations. [0]

If a reasonable person wouldn't realize that they were "exceeding authorized
access", that probably limits a potential CFAA claim, but that's it, and
that's not only the potentially perilous statute when you're a scraper. In the
QVC case, Resultly got lucky that QVC did not have an up-to-date robots.txt;
otherwise, they very well may have been on the hook for multiple days of lost
online revenue, despite their immediate cessation upon receipt of a C&D.

Again, you are more than welcome to take your perspective and run with it, and
it's plausible that no one will get mad enough at you to sue over it. That
doesn't change the law.

I would assume that 3Taps pursued this litigation not because they had special
love for PadMapper, but because they felt it was important for their business
to be allowed to scrape major data sources and thought they'd be able to win.
Pretty sure Skadden was their law firm so they gave it an earnest try, but
ultimately lost.

[0]
[https://en.wikipedia.org/wiki/Treble_damages](https://en.wikipedia.org/wiki/Treble_damages)

~~~
brilliantcode
You can be sued for crossing the street. You can be sued for flipping the bird
and someone happens to get aneurism from it. You can be sued for writing what
you just wrote!

~~~
cookiecaper
Yes, but if you fight it adequately, you won't lose. If you get sued for
scraping, it's quite likely you'll lose, as the law has numerous pitfalls for
scrapers, including things as basic as regarding RAM copies as infringing.

------
ivanhoe
We all do this, but how legal is it? If people end up in prison for pen
testing without permission, how safe is to intentionally alter the user-agent,
circumvent captchas, javascript and other protections? Can that be considered
as hacking a site and stealing the data?

------
piker
Proposition: 99% of scraping use cases are eliminated if the scraper agrees to
subsequently abide by the target's terms of service.

~~~
jazoom
Googlebot doesn't abide by 99% of websites' terms of use.

~~~
simplyluke
But that's "different" because they've built a $600bn company off it.

~~~
jazoom
More that the websites actually want to be found by someone.

~~~
camus2
The over-reliance on google search is both a blessing and a curse for the web.
Today, google IS basically the web, a centralized version of it.

------
herbst
I've used antigate for captchas and ether Tor or proxies for 403s before.
Usually the browser header alone does not help for long.

~~~
tehlike
Anticaptcha and deatbycaptcha are some others. But it mames me feel sad to use
them, as it exploits cheap labor overseas.

~~~
homakov
Most of the time they use OCR, humans are unreliable and rarely used.

~~~
herbst
no, at least antigate doesnt. When you hit recaptcha with known proxy urls (or
generally hit it a few times per hour) the captchas get so bad that no OCR
would be able to solve it, even humans struggle

~~~
tehlike
yup, exactly. i tried tesarract before (nothing too fancy), it didn't have
problems solving it, but at some point it became really hard.

I think part of it is how you crawl (phantomjs, for example, seem to hit
captcha almost every time), but things like ip&proxy usage could make this
trigger more often.

------
jordif
Good article! I been doing scraping for the last 10 years and I've seen a lots
of differents things to try to avoid us. Also, I'm in the other side
protecting websites to ban scrapers, so funny!

~~~
skinnymuch
I'm in the same position for the first time (protecting against scraping) and
honestly I'm kind of blind right now. Which is weird because of how much
scraping I've done (okay not that much). Any tips or tricks or blogs you know
of off the top of your head for protecting your site?

~~~
corford
Virtually everything can be easily defeated. The only outfit I've consistently
seen put up a good fight is Distil. They do it by acting a little like
Cloudflare. They put their servers in front of your www facing endpoints and
use ML to mine their global client traffic to identify bot signals (aided by
some aggressive in-browser javascript fingerprinting).

~~~
kbenson
Yeah, Distil is the first outfit I've encountered where they've got the model
to make it really hard to _reliably_ bypass. It comes down to "I can spend a
significant amount of time trying to bypass this, and I would, but they would
likely identify and block me again within a few weeks at most.", and it's not
worth it when it's only _part_ of what I need to do to scrap some data, and
it's their entire job, and they can afford to hire multiple people.

The economics are in their favor, and I make it a point not to fight economics
when I recognize them, it's rarely sustainable.

~~~
skinnymuch
Distil is really interesting.

------
fiatjaf
What if the target page is blocking by IP address and if even with 20
different IP addresses you wouldn't be able to fetch all the data you need in
a month?

~~~
corford
Professional proxy services. Price, IP pool size and quality vary hugely but
if you're not trying to scrape an aggressively defended target and don't need
to make more than a handful of requests per second, 100K IPs will usually be
more than enough to circumvent most rate limits and a pool that size can be
rented for under $100/month.

~~~
gbrits
Interested to know where to get 100k proxies for $100/mo. Can you give some
options?

------
ic3cold
Have you seen the sentry antirobot system I can't remember the name exactly
but it's a hosted solution that randomly displays captchas, when it senses
suspicious(robots) crawling. It's a nightmare, because after you solve 1
captcha it can display 4 more one after the other. They also ban your IP, so
oyu need IP rotators. Any workarounds? ic3cold

------
ge96
What if they use that before:after thing where the content takes say a couple
of seconds to appear so when you try to scrape the site it appears that
nothing is there. I have only used HTMLSimpleDom scraper with PHP at this
point.

------
mirimir
Sometimes it's also necessary to spread requests over numerous IP addresses.

------
dmn001
The first part seems like a very long-winded way to say "don't use the default
user agent".

The captcha was unusually simple to solve, in most cases the best strategy is
to avoid seeing it in the first place.

------
eapen
Enjoyed learning this and playing with it. What would you recommend storing
this sort of data in? Not too keen on going with the traditional MySQL.

------
bla2
Nice overview! The "unfortunately-spelled threat_defence.php" just uses
British spelling though.

~~~
richthegeek
What's wrong with British spelling? It's also the English spelling using in
India, Australia, New Zealand etc. By pure numbers, more people may spell it
defence than defense. Americocentrism is quite annoying from the other side :)

~~~
bla2
I'm not saying anything's wrong with it. The "unfortunately named" bit is from
the article, and I'm just pointing out that the author's snark is ill-placed.

------
ouid
too bad it's named for ovine prions.

------
known
Try lynx

------
Exuma
Great article!

