
Web Scraping in 2016 - franciskim
https://franciskim.co/2016/08/24/dont-need-no-stinking-api-web-scraping-2016-beyond/
======
minimaxir
Keep in mind that companies have _sued_ for scraping not through the API, for
example LinkedIn, which explicitly prevents scraping via the ToS:
[http://www.informationweek.com/software/social/linkedin-
sues...](http://www.informationweek.com/software/social/linkedin-sues-after-
scraping-of-user-data/d/d-id/1113362)

OKCupid did a DMCA takedown for researchers releasing scraped data:
[https://www.engadget.com/2016/05/17/publicly-released-
okcupi...](https://www.engadget.com/2016/05/17/publicly-released-okcupid-
profiles-taken-down-dmca-claim/)

Since both of these incidents, I now only scrape if it's a) through the API
following rate limits or b) if there is no API, and the data has the explicit
purpose of being shared publically (e.g blogs), I follow robots.txt. Of
course, most companies have a do-not-scrape clause in their ToS anyways, to my
personal frustration.

(Disclosure: I have developed a Facebook Page Post Scraper
[[https://github.com/minimaxir/facebook-page-post-
scraper](https://github.com/minimaxir/facebook-page-post-scraper)] which
explicitly follows the permissions set by the Facebook API.)

~~~
hobs
This post is kind of crazy, aggrandizing bad behavior and misuse of other's
resources against their will.

Scraping against the TOS is super bad netizen stuff, and I dont think people
should be posting positive reviews of people doing this. Breaking captchas and
the like is basically blackhat work and should be looked down upon, not
congratulated as I see in this thread.

~~~
madamelic
>Scraping against the TOS is super bad netizen stuff, and I dont think people
should be posting positive reviews of people doing this. Breaking captchas and
the like is basically blackhat work and should be looked down upon, not
congratulated as I see in this thread.

Not really.

Scraping, in my opinion, isn't black hat unless you are actually affecting
their service or stealing info.

If you are slamming the site with requests because of your scraping, yeah you
need to knock it off. If you throttle your scraper in proportion to the size
of their site, you aren't really harming them.

In regards to "stealing info", as long as you aren't taking info and selling
it as your own (which it seems OP is indeed doing), that is just fine.

tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their
service or business.

~~~
muglug
> If you throttle your scraper in proportion to the size of their site, you
> aren't really harming them.

And do you understand their site infrastructure to know whether you're doing
harm? It's perfectly possible that your script somehow bypasses safeguards
they had in place to deal with heavy usage, and now their database is locking
unnecessarily.

~~~
cookiecaper
Eh, this is pretty weak. Scrapers are no different from other browsing
devices. The web speaks HTTP. There's no reason that using another HTTP
browser would cause any disparate impact just by virtue of not being a
conventional desktop browser -- you've thrown out a pretty absurd
hypothetical. In fact, scrapers usually cause _less_ impact because they
usually don't download images or execute JavaScript.

I did an analysis and a session browsed with my specialized browser would
always consume less than 100K of bandwidth (and often far less), whereas a
session browsed with a conventional desktop browser would consume _at least_
1.2 MB, even if everything was cached, and sometimes up to 5 MB. In addition,
on the desktop, a JavaScript heartbeat was sent back every few seconds, so all
of that data was conserved too.

Because we were a specialized browser used by people looking for a very
specific piece of data, we could employ caching mechanisms that meant that
each person could get their request fulfilled without having to hit the data
source's servers. We also had a regular pacing algorithm that meant our users
were contacting the site way less than they would've been if they were using a
conventional desktop browser.

Our service saved the data source a _large_ amount of resource cost. When we
were shut down, their site struggled for about two weeks to return to
stability. I think they had anticipated the opposite effect.

Our service also saved our users a large amount of time. We were accessing
publicly-available factual data that was not copyrightable (but only available
from this one source's site). There's no reason that the user should be able
to choose between Firefox and Chrome but not a task-specialized browser.

It is true that some people will (usually accidentally) cause a DDoS with
scrapers because the target site is not properly configured, but the same
thing could be done with desktop browsers. It doesn't mean that scrapers
should be disadvantaged.

~~~
hchasestevens
A small counterpoint to this -- in the airline industry, it's relatively
commonplace for seat reservations to be made for a user _before_ payment has
occurred. In this case, if you're mirroring normal browser activity, you can
(temporarily) reduce availability on a flight, potentially even bumping up the
price for other, legitimate users, and almost certainly causing the airline to
incur costs beyond normal bandwidth and server costs. I'm sure there are many
other domains for which this is also the case, however rare.

~~~
gnud
If they don't do the seat reservation behind a POST, or at least blacklist the
reservation page in robots.txt, I have no sympathy.

------
mack73
Corporations will abuse your personal integrity whenever they get a chance,
while abiding the law. Corporations will cry like babies when their publicly
available data (their livelyhood) gets scraped. They will take you to court.

They consider their data to be theirs, even though they published it on the
internet. They consider your data (your personal integrity) to be theirs as
well, because how can you assume personal integrity when you are surfing the
internet?

I have high hopes that the judicial system some time not too far from now will
realize that since the law should be a reflection of the current moral
standings it will always be behind, trying to catch up with us and that those
who break the law while not breaking the current moral standings are still
"good citizens" unworthy of prison or fines.

I guess Google won this iteration of the internet because of the double-
standars site owners stand by, to allow Google to scrape anything while
hindering any competitors from doing the same. There will only be a true
competitor to Google when we in the next iteration of the internet realize
that searching vast amounts of data (the internet) is a solved problem, that
anyone can do as good a job as Google, and move on to the next quirk, around
wich there will be competition, and in the end that quirk will be solved,
we'll have a winner, signaling that is it time to move on to the next
iteration.

~~~
dspillett
> Corporations will abuse your personal integrity whenever they get a chance,
> while abiding the law.

Call my cynical if you will, but I'd leave "while abiding the law" out of
that, or at least replace it with "while hoping they aren't breaking the law".
Due diligence on these matters is often sadly lacking. They'll take the
information first and only consider any such implications when/if they come up
later.

Large organisations like Google probably will make the up-front effort to
remain legal, because they are in the public eye enough for lack of doing so
to attract a lot of unwanted press, but you don't have to get a lot smaller
than that to start finding companies who are a lot less careful (or in some
cases wilfully negligent).

~~~
cm2187
I would use Microsoft as a precedent. Sure they will attempt to stay legal but
by pushing it as far as they can.

For instance the browser choice script that came with Windows imposed by the
EU never worked. It was a "bug". Somehow they must have omitted to test the
feature...

Until last year Microsoft started playing nice, and I think Google and
Facebook have become the new corporate villains. But recently the Windows team
seems to be minded to challenge them in that position.

------
fake-name
I do a significant amount of scraping for hobby projects, albeit mostly open
websites. As a result, I've gotten pretty good a circumventing rate-limiting
and most other controls.

I suspect I'm one of those bad people your parents tell you to avoid - by that
I mean I completely ignore robots.txt.

At this point, my architecture has settled on a distributed RPC system with a
rotating swarm of clients. I use RabbitMQ for message passing middleware,
SaltStack for automated VM provisioning, and python everywhere for everything
else. Using some randomization, and a list of the top n user agents, I can
randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS
gets you through non-capcha cloudflare. Backing storage is Postgres.

Database triggers do row versioning, and I wind up with what is basically a
mini internet-archive of my own, with periodic snapshots of a site over time.
Additionally, I have a readability-like processing layer that re-writes the
page content in hopes of making the resulting layout actually pleasant to read
on, with pluggable rulesets that determine page element decomposition.

At this point, I have a system that is, as far as I can tell, definitionally a
botnet. The only things is I actually pay for the hosts.

\---

Scaling something like this up to high volume is really an interesting
challenge. My hosts are physically distributed, and just maintaining the
RabbitMQ socket links is hard. I've actually had to do some hacking on the
RabbitMQ library to let it handle the various ways I've seen a socket get
wedged, and I still have some reliability issues in the SaltStack-DigitalOcean
interface where VM creation gets stuck in a infinite loop, leading to me
bleeding all my hosts. I also had to implement my own message fragmentation on
top of RabbitMQ, because literally no AMQP library I found could _reliably_
handle large (>100K) messages without eventually wedging.

There are other fun problems too, like the fact that I have a postgres
database that's ~700 GB in size, which means you have to spend time
considering your DB design and doing query optimization too. I apparently have
big data problems in my bedroom (My home servers are in my bedroom closet).

\---

It's all on github, FWIW:

Manager: [https://github.com/fake-
name/ReadableWebProxy](https://github.com/fake-name/ReadableWebProxy)

Agent and salt scheduler: [https://github.com/fake-
name/AutoTriever](https://github.com/fake-name/AutoTriever)

~~~
monsoon22
How do you circumvent cloud provider IP blocks? For example, one site blocks
all requests from AWS EC2 servers.

~~~
fake-name
None of the sites I'm scraping do that, mostly.

I'm not scraping high value sites like that (I mostly target amateur original
content). It's not really of interest to other businesses. As such, I tend to
just run into things like normal cloud-flare wrapped sites, and one place that
tried to detect bots and return intentionally garbled data.

If I run into that sort of thing, I guess we'll see.

------
prashnts
A neat trick I sometimes use to "scrape" data from sites that use jquery ajax
to load data is to plug in a middleware in jquery xhr:

    
    
          $.ajaxSetup({
            dataFilter: function (data, type) {
              if (this.url === 'some url that you want to watch!') {
                // Do anything with the data here
                awesomeMethod(this.data)
              }
              return data
            }
          })
    
    

I remember last using it with an infinite-scroll page with a periodic callback
that scrolled the page down every 2 seconds, and the `awesomeMethod` just
initiated the download. Pasted it all in dev-tools console, and the cheap
"scraper" was ready!

~~~
pault
You can also build a chrome extension if you need to navigate to multiple
pages and use a long-running scraping process. I've done this several times
and it's really easy to get one up and running if you use an extension
boilerplate (30 minutes tops).

~~~
esac
do you have something? i was going to write the very same extension (but
distributed so i could add it to my pc and my friends) but never did that

~~~
pault
This is the boilerplate I used last time:
[http://extensionizr.com](http://extensionizr.com)

~~~
prashnts
Didn't know about `extensionizr`, Looks super cool. Thanks!

------
danso
This good list of tactics underscores, for me, how the state of the Web has
made it a lot more difficult to teach web scraping as a fun exercise for
newbie programmers. It used to be you could get by with an assumption that
what you see in the browser is what you get when you download the raw
HTML...but that's increasingly less common the case. So now you have to teach
how to debug via the console and network panel, on top of basic HTTP concepts
(such as query parameters).

(Even more problematic is that college kids today seem to have a decaying
understanding of what a URL is, given how much web navigation we do through
the omnibar or apps, particularly on mobile, but that's another issue).

I've been archiving a few government sites to preserve them for web scraping
exercises [0] (the Texas death penalty site is a classic, for both being
relatively simple at first, and being incredibly convoluted depending on what
level of detail you want to scrape [1])). But I imagine even government sites
will move more toward AJAX/app-like sites, if the trend at the federal level
means anything.

That said, I think the analytics.usa.gov site is a great place to demonstrate
the difference between server-generated HTML and client-rendered HTML.

But as someone who just likes doing web-scraping, I feel the tools have mostly
kept up with the changes to the web. It's been relatively easy, for example,
to run Selenium through Python to mimic user action [2]. Same with PhantomJS
through node, which has vastly improved how accurately it renders pages for
screenshots compared to what I remember a few years back

[0] [https://github.com/wgetsnaps](https://github.com/wgetsnaps)

[1] [https://github.com/wgetsnaps/tdcj-state-tx-us--
death_row](https://github.com/wgetsnaps/tdcj-state-tx-us--death_row)

[2]
[https://gist.github.com/dannguyen/8a6fa49253c1d6a0eb92](https://gist.github.com/dannguyen/8a6fa49253c1d6a0eb92)

~~~
niftich
It's unfortunate that nearly every webpage these days is a Javascript State
Machine which you have to execute in a sandbox and inspect its internal state
to get stuff out of.

On a blog post by Paul Kinlan ('Open Web Advocate' at Google and Chromium)
[1], I lamented that we ended up here instead of the semantic web because the
semantic web was hard to execute. Instead, every web page is a black-box, only
navigable by an intelligent and/or sufficiently persuadable human.

But this is also why I don't buy ethical arguments against scraping. Sure,
_legally_ any company can unilaterally set any TOS prohibition against
behavior they don't want, and these terms may be tested in court. But
navigating a page in an automated manner that's designed to resemble
interactions of humans (ie. through Selenium) is in my opinion _ethical_ ,
because it merely time-shifts a user's activity.

[1]
[https://news.ycombinator.com/item?id=12206846](https://news.ycombinator.com/item?id=12206846)

------
XCSme
Tbh I didn't enjoy the article, it just seems like someone who has just
learned about Node.js tried to explain (and mostly failed) how to use some
packages to scrape a page. I was expecting to learn some new techniques, but
all it explained was how to make a few API calls in order to solve a very
specific problem. Also, there was the overall arrogant tone: "I found their
interview approach a bit of a turn off so I did not proceed to the next
interview and ignored her emails ", this just shows a lot of immaturity.

~~~
nathancahill
Part of the turnoff for me was the middle-schooler tone and vocabulary. Good
walkthrough with good code examples though, obviously written by a very smart
JS dev.

~~~
franciskim
Ok, I'll try to explain to this thread. I actually thought about removing the
Facebook part, but I kept it in there because that is kind of how I felt and
it is real. The middle-schooler tone and vocab is probably because I don't
read a lot of books, and English is my 2nd language.

In reply to XCSme - no I am not new to Node and my point of the post is to
illustrate some of the techniques that I haven't seen published anywhere to HN
and the community. My focus is quite different from what you think it is, so
maybe it is my bad for bad writing skills, I'm still new to writing and
learning.

------
Jake232
Not wanting to thread hijack, but just going to post an article I wrote a few
years back as it covers a few other things that are still relevant; and often
still gets referenced. May it'll help some people out in combination with OP's
post.

[http://jakeaustwick.me/python-web-scraping-
resource/](http://jakeaustwick.me/python-web-scraping-resource/)

~~~
mdaniel
I was surprised to not see Scrapy listed, but then I saw there were some
comments about it - but seriously, doing by hand what Scrapy has spent _years_
perfecting is highly suboptimal.

I guess the distinction is between whether one wants to just "toy around" or
run the spider for-real.

------
stupidcar
I wrote a fairly complex spidering and scraping script in Node a few months
ago. I found downcache[1] to be absolutely invaluable, particularly as I was
debugging my parsing scripts, a I was able to rerun them relatively quickly
over the cached responses.

However, when the network was no longer a bottleneck, I found that the speed
and single-threaded nature of Node became one. It wasn't really that slow,
relatively speaking, but I had a few hundred gigs of HTML to chew through
every time I made a correction, so it was important to keep the turnaround as
fast as possible.

I eventually managed to manually partition the task so I could launch separate
Node scripts to handle different parts of it, but it wasn't a perfect split,
and there was a fair bit of duplicated work, where a shared cache would have
helped a great deal.

In retrospect, I should have thrown my JS away and started again in something
with easy threading like Java or C#. But -- familiar story -- I'd
underestimated the complexity of the task to begin with, and by the time I
understood, I'd sunk a lot of time into writing my JS parsing code and didn't
fancy converting it all to another language, particular when it always seemed
like "just one more" correction to the parsing would make everything work
right. In the end, what was supposed to take a weekend took about three months
of work, off and on, to finish.

[1]
[https://www.npmjs.com/package/downcache](https://www.npmjs.com/package/downcache)

~~~
ralusek
Threading in node is very easy, just use clusters. Alternatively, take any of
the CPU intensive activity, like parsing the HTML and formatting as JSON, and
just put that on an AWS lambda.

You can invoke as many lambdas from your application as you want in parallel
and you're not going to be bottlenecked by your CPU :)

~~~
stupidcar
Clustering in Node creates isolated child processes, not threads. I needed to
have shared queues, in-memory caches, and hashes to coordinate workers and
avoid them doing duplicate work.

I'm did consider using clustering and having some master process coordinate
everything, and using some shared-memory caching library. But it would not be
"easy" to set up, especially compared to something like Java where you get
thread pools and synchronized thread-safe collections out of the box.

And Lambda would have been totally impractical. As I said, I had hundred of
gigs of data to process. If I'd been uploading this over my puny ADSL upstream
every time, I'd still be waiting for a single run to complete.

I'm not trashing Node. I like it. There's a reason I used in the first place,
after all. But for this particular use-case, I didn't find it was very good
fit.

~~~
LunaSea
Threading for a crawler is just a dirty way of not handling distribution. When
you will need more than one server your threads won't save you. It has nothing
to do with Node.js and thread support.

~~~
stupidcar
I wasn't creating a new search engine, I was doing a one-off scraping job in
my spare time. Creating a fully distributed solution would have been total
overkill. But threading could and would have helped.

Honestly, stupidly hostile and ignorant comments like this are the absolute
worst thing about Hacker News.

------
dchuk
Scraping with Selenium in Docker is pretty great, especially because you can
use the Docker API itself to spin up/shut down containers at will. So you can
spin up a container to hit a specific URL in a second, scrape whatever you're
looking for, then kill the container. This can be done via a job queue
(sidekiq if you're using Ruby) to do all sorts of fun stuff.

That aside, hitting Insta like this is playing with fire, because you're
really dealing with Facebook and their legal team.

~~~
spikej
Serious question: What do you gain from having an extra layer like docker?

~~~
franciskim
Selenium grid runs in docker, so it's easy to have multiple instances running.
Better control.

~~~
ramblenode
What are the advantages of this versus a thread pool of web drivers? I'm not
really familiar with Selenium Grid.

~~~
paulryanrogers
Grid can dynamically dispatch based on the browser and capabilities you want
when you create the session.

------
mosburger
> AngelList even detects PhamtomJS (have not seen other sites do this).

I run a site that aggregates/crawls job boards for remote job postings, and
AngelList has been VERY difficult to crawl for various reasons, but you easily
get PhantomJS to work (I have). Having said that, I've never felt very good
about the fact that I'm defeating their attempts to block me (even though I
feel like I'm doing them a favor) and will likely retire that bot soon.

It kinda sucks that I'm just grabbing publicly-available content in a very
low-bandwidth way, but I really can't convince myself that what I'm doing is
very ethical.

My to-do list includes making my crawler into a more well-behaved bot and that
will have to go.

~~~
ixtli
I think you may want to decouple your ethical analysis from which private
company is making the most money. Remember that the only functional difference
between you and somewhere like kayak.com or padmapper is business
relationships.

------
pault
I don't know why more people don't use chrome extensions for scraping. Using a
boilerplate[1], you can get a scraper up and running in minutes. Start a node
server that serves up urls and stores parsed data, and run the scraper in the
browser. Best of all, you can watch it running and debug if something goes
wrong. I know it doesn't scale well if you're running a SaaS, but for personal
projects and research/data normalization it's the lowest barrier to entry, in
my opinion.

[1] [http://extensionizr.com](http://extensionizr.com)

------
franciskim
Sorry guys, hit by traffic - just scaling my EC2 at the moment.

~~~
niftich
No worries, we had your page scraped just in case ;)

Google Cache link:
[http://webcache.googleusercontent.com/search?q=cache:https:/...](http://webcache.googleusercontent.com/search?q=cache:https://franciskim.co/2016/08/24/dont-
need-no-stinking-api-web-scraping-2016-beyond/)

Archive.is link: [http://archive.is/DQccs](http://archive.is/DQccs)

~~~
franciskim
haha :)

------
jgmmo
Good stuff.

I do a good bit of scraping, and made RubyRetriever[1] to make my life easier
but it seems like I'm getting roadblocked on occasion, probably due to some of
the things you mention in your article.

Is there any way for a site to verify that only their JS and CSS files are
linked? Like preventing injection?

[1]:
[https://github.com/joenorton/rubyretriever](https://github.com/joenorton/rubyretriever)

~~~
throwanem
You could inspect the src attributes of script tags, and the href attributes
of link tags with rel="stylesheet", for acceptable domains. I doubt it would
cover all cases, but it might be a start.

------
nreece
At Feedity ([https://feedity.com](https://feedity.com)), we "index" webpages
to generate custom feeds. Over the years, we've designed our system to use a
mix of technologies like .NET (C#) and node.js, and implemented a bunch of
tweaks and optimizations for seamless & scalable access to public content.

~~~
LunaSea
Any tips and tricks you are able to share about the technologies you guys
developed? It would be especially interesting to see what you use for text
extraction from HTML.

------
IANAD
> But if you are automating your exact actions that happen via a browser, can
> this be blocked?

Yes, by checking times between actions and number of actions in a time period,
and blocking atypical activity. I was IP banned from a site once for a few
months, after trying to scrape it too much and hitting links on the site that
were hidden from humans.

The random wait settings specified in the post are better than nothing, but
still too flimsy. You would need to put hours between requests, only request
during a certain 15 hour periods, take days off, and eventually you aren't
scraping regularly enough to do much good.

Scraping is not an API, and I should know- I used to do it for a living. Its
unreliable. It requires constant maintenance. APIs can break too, but they are
meant for the sort of consumption you are trying for.

If you scrape for a living, only do it as a side job.

~~~
red_blobs
It really depends on the data you are scraping. My main business relies on
scraping and my data mining application has been running for over 5 years. If
you have enough IP addresses available to you, it becomes almost impossible to
distinguish it from normal users hitting the site...and bandwidth has gotten
so cheap, the overhead is very affordable.

I've noticed that most sites actually don't change that often. I deal with
changes once or twice every 3 months.

"If you scrape for a living, only do it as a side job."

This is true if you are scraping the low hanging fruit. I scrape 40+ sources
(I do have access to a few APIs as well) and then have to extract the
patterns/data I need to then integrate it into my business model. This is all
automatic now and I only work on upgrading for speed and efficiency.

If you have to scan millions of urls daily from 1 site, it's probably not
going to work out. You need to figure out clever ways of getting the data and
using it without breaking any laws or pissing off the site owner.

------
headmelted
I actually love Selenium for this purpose, for much the same reasons the
author mentions here.

It's almost impossible for a website to reliably detect that a client web
browser is being automated, and I find I can make Selenium scripts much more
adaptable to breaking changes in websites when they occur than I can when
hooking up my code directly.

I actually disagree with the contention that Selenium is slower than directly
scraping though. The Firefox driver has always been lightning fast for me and
the bottleneck is almost always server requests that would have been necessary
either way.

------
lamby
Whilst they mean well, I find this a fundamentally deceptive — the arduous
parts of "real world" scraping simply aren't in the parsing and extraction of
data from the target page, the typical focus of these "scrape the web with X"
articles.

The difficulties are invariably in "post-processing"; working around
incomplete data on the page, handling errors gracefully and retrying in some
(but not all) situations, keeping on top of layout/URL/data changes to the
target site, not hitting your target site too often, logging into the target
site if necessary and rotating credentials and IP addresses, respecting
robots.txt, target site being utterly braindead, keeping users meaningfully
informed of scraping progress if they are waiting of it, target site adding
and removing data resulting in a null-leaning database schema, sane
parallelisation in the presence of prioritisation of important requests,
difficulties in monitoring a scraping system due to its implicitly non-
deterministic nature, and general problems associated with long-running
background processes in web stacks.

Et cetera.

In other words, extracting the right text on the page is the easiest and
trivial part by far, with little practical difference between an admittedly
cute jQuery-esque parsing library or even just using a blunt regular
expression.

It would be quixotic to simply retort that sites should provide "proper" APIs
but I would love to see more attempts at solutions that go beyond the
superficial.

~~~
dismantlethesun
> the arduous parts of "real world" scraping simply aren't in the parsing and
> extraction of data from the target page, the typical focus of these "scrape
> the web with X" articles.

I can agree with this after having written a scraper as part of core business
functionality (we paid a company for access, but access was just to bare HTML
blobs and CVS and not an actual API).

However, to what degree you want to do all this is negotiable whereas the
'core' of screen-scraping is not---all scrapers have to first figure out how
to get text, parse it, then stick it back in their system.

An example of what I mean when I say 'negotiable' is....

> working around incomplete data on the page

Deciding how to do this depends on your problem domain. Sometimes, we'd get
bad computed data from our source but not care because it just meant more work
putting more work in calculating it from a more raw source.

> not hitting your target site too often

If they publish how often you are allowed to scrape, this isn't too difficult.
If not, then trial and error is the only solution. On occasion, a site simply
just doesn't know/care. For example, in my case, the site was static content
behind a CDN, so that if we were anywhere under 200 req/second then no flags
would ever be raised.

For most smaller sites, that you are unofficially scraping, you may be limited
to 1 request every 2 seconds.

------
Twisell
What bother me the most is that recently I wanted to extract and archive of
all the threads I participated in from an Internet forum. The webmaster told
me that the BBS he use don't provide such a function and that I just had to
download each thread manually... (300+ thread in my case).

He then say that it don't bother him if I scrape theses thread. And I'm
currently figuring out how to manage his site's cookie protected search
feature, so that my painstaking effort (I'm not a dev, more a DB guy) could be
reproducible more easily by other users of this service.

But this shouldn't appen in the first place because all post of this service
are stored in a cleanly organized MySQl DB. Yet as no method is provided the
only way to get back structured data is by scrapping (as the webmaster told me
that no, he won't run custom SQL because "he don't want to mess his DB").

So even if all the data is publicly available through the internet forum only
a geek can download a personal archive... or google because google scrape and
store everything.

~~~
CalRobert
It's overkill for most things, but I have found that on occasion the best way
to scrape stuff behind annoying frontends is with Selenium. pysaunter is a
useful library that's one layer of abstraction higher, if you're familiar with
Python.

~~~
CalRobert
Well I see now that I'm really late to the party with that comment.

------
KennyCason
As someone who does a lot of scraping, I was happy to learn about Antigate :)

~~~
KennyCason
Just joking as I don't scrape unless scraping is allowed. :)

------
kingkool68
It's trivial to scrape public Instagram URLs...

[https://github.com/kingkool68/zadieheimlich/blob/master/func...](https://github.com/kingkool68/zadieheimlich/blob/master/functions/instagram.php#L421-L428)

------
etatoby
Does anybody know what the author means by "lead" (noun)?

I don't think it's any of the regular meanings:
[http://www.ldoceonline.com/search/?q=Lead](http://www.ldoceonline.com/search/?q=Lead)

But it doesn't seem to be any of these slang terms either:
[http://www.urbandictionary.com/define.php?term=lead](http://www.urbandictionary.com/define.php?term=lead)

~~~
stedaniels
[https://en.wikipedia.org/wiki/Lead_generation](https://en.wikipedia.org/wiki/Lead_generation)

------
writeslowly
Have you run into any issues from running all of your scrapers off of AWS, or
just from sites detecting that you're accessing large numbers of pages in some
sort of obvious pattern? I guess I was hoping there would be sites with more
interesting ways to screw with web scrapers (rearranging certain page elements
or something) than just throwing up a CAPTCHA.

~~~
madamelic
Most really don't. A lot of big sites don't seem to care, at least in my
experience.

The few that I've seen just 'ban' your IP for a few minutes. If you hit
Wikipedia too much too quickly, they will essentially refuse to serve you for
a while. It was a number of years ago I was doing it, but basically you would
be scraping then you would just stop getting info (Maybe I wasn't reading
response codes and could've realized quicker what was happening)

~~~
detaro
Wikipedia provides you with an API and guidelines on how to use it, so you
really shouldn't be scraping it directly or so much you hit enforced limits.

~~~
user5994461
Wikipedia provides archives of all its content.

No need to scrap it when you can readily download a nicely formatted .xml.zip
file containing all knowledge written by mankind.

------
zzzcpan
> But if you are automating your exact actions that happen via a browser, can
> this be blocked?

Of course it can! You won't be able to defeat even the simplest attempt on
anti-scraping based on statistical data. Like even keeping a list of
individual rate-limits for /16 subnets of actual visiting users and you are in
trouble.

------
kevindeasis
Does cheerio account for single page apps? In any case thanks for the
tutorial!

Anyways I added your stuff here along with other data mining resource:

[https://github.com/kevindeasis/awesome-fullstack#web-
scrapin...](https://github.com/kevindeasis/awesome-fullstack#web-scraping)

------
elchief
To fight scrapers, we show some values as images that look like text (but not
all the time)

And we insert random (non-visible) html and css classes in our site to screw
with em, and use randomized css classnames. This fucks with xpaths and css
selectors.

You can't stop them, but you can make their lives painful.

~~~
elmigranto
> To fight scrapers, we show some values as images that look like text

You are fighting screen readers more than anything; as well as legitimate
plugins, form autofills, etc. If this is for captcha, you are fighting all the
users as well.

> And we insert random (non-visible) html and css classes in our site to screw
> with em, and use randomized css classnames.

Legitimate browser plugins, etc. I'd just use electron or selenium with `nth-
child`, `:visible`, `[class*="…"]`, etc.

What you effectively doing is wasting time on useless stuff. This is even more
useless than trying to prevent copying of DVDs or pirating games.

~~~
BrandonMarc
> What you effectively doing is wasting time on useless stuff. This is even
> more useless than trying to prevent copying of DVDs or pirating games.

Can you be so sure? The Union blockade of the Confederacy had plenty of holes,
and smugglers / privateers / blockade-runners made good money getting through
(when they survived) ... but that doesn't mean the blockade wasn't effective
all the same at weakening the Confederate military and economy.

~~~
elmigranto
Do you really not see the difference between military blockade and randomizing
CSS classes?

~~~
BrandonMarc
Honestly, no. This one time I was pissed off at Egypt for undercutting me in
cotton prices, so I tried to set up a blockade to prevent merchant ships going
in and out of Cairo.

... and it would have worked, too, except for my naval vessels were all CSS
classes. I even tried to name them cleverly, a la "USS hero unit" or "USS
datatable table-consensed span9", but my plan was foiled.

------
skeletonjelly
Hooray Melbourne! Would be interested seeing this at a meetup group if you
were thinking of presenting.

~~~
danieltrembath
Another 3000'nder. Would be great to see this turned into a talk somewhere.

~~~
skeletonjelly
For sure. Trying to think which ones. Probably the MelbJS one and maybe
dddmelb? You could modify it to talk at the OWASP one perhaps. Which ones have
you been to?

------
frostymarvelous
While everyone is busy debating whether scraping is bad or legal, I just can't
stop thinking a out Antigate.

Of the sweatshops that must have been setup to deliver this service. That, is
to me the true horror of this story.

------
slig
I wonder how effective the CloudFlare anti-scrapper protection is against this
approach of breaking captchas.

Also, I find it interesting that big websites don't just block all traffic
from AWS IPs as they do with Tor.

~~~
user5994461
There can be legitimate traffic coming from AWS, if not the site itself.

It's especially true when the site provides an API and is meant to be
integrated by people/companies. In which case, the AWS traffic is likely to
include major and/or important and/or paying customers. You really don't want
to block that.

On the other hand, Tor is likely to be 90% evil. When in doubt, just block it.
(That makes me think, I should run some proper stats and maybe publish a blog
post about that. )

~~~
slig
> There can be legitimate traffic coming from AWS, if not the site itself.

The traffic from the site itself, if it's hosted there, would come from the
intranet IP address, right? Not the public facing one.

> It's especially true when the site provides an API and is meant to be
> integrated by people/companies. In which case, the AWS traffic is likely to
> include major and/or important and/or paying customers. You really don't
> want to block that.

Agreed, but it's fairly easy to block the AWS IP traffic on web endpoints and
not on the API endpoints.

------
unixhero
And from the trenches:

\- rails application

\- scraping with nokogiri gem on Ruby

\- simple models doing the scraping in rails app

\- some scraping is parsed with CSS selectors - nokogiri

\- some scraping is parsed with regex - nokogiri

\- persisting to DB, Text, even Google docs

\- presentation on web, text, pdf, xls

Boom

------
ge95
How do you push a button like hit next on a paginated page?

~~~
oli5679
Right click on the 'next' button in chrome and use 'inspect element' to find
its id/class/css selector and then:

browser.findElement(webdriverio.By.id('#Next')).click();

------
rch
There is so much that's missing from this. What about gathering tokens from
customers vs. paying for social data feeds? How about canned services like
80legs?

~~~
franciskim
Hmm yeah there are a lot of other things that I could write about. 80legs seem
like another Scrapy type of SaaS? Not sure what you mean about gathering
tokens from customers.

~~~
rch
I've heard of companies that scrape on behalf of customers who will walk
marketing people through the process of creating an API token to help mitigate
rate limiting.

------
ben_jones
Currently getting 502 Gateway. Guessing this post is also trending on reddit
and we hugged it to death :(.

~~~
franciskim
I'm on Reddit?

~~~
soared
Google Analytics > Acquisition > Source/Medium > type "reddit" in search bar.
Add secondary dimension "referral path"

------
rezashirazian
When I was building liisted.com I scraped using Selenium and it worked great.

