
How we learnt to stop worrying and love web scraping - danso
https://www.nature.com/articles/d41586-020-02558-0
======
b6z
I admit being a bit disappointed that a well-known disadvantage of web
scraping was not mentioned: Web scraping is fragile!

Web sites change, web frameworks evolve, and just some subtle reordering of
some <divs> or renaming of CSS classes, and your perfect scraping code from
yesterday will break tomorrow -- maybe not leaving you empty-handed, but
probably missing some data or delivering the wrong one.

If there is an API you can use, use it. If your budget allows to pay for API
access, buy it. APIs tend to be more stable than scraping, and the data
provider will probably inform you if it changes. Contacting them might even
get you more interesting data, as not every column they have in their database
might become published on the web site.

~~~
hansvm
I've found the opposite to be true -- when an entity is maintaining an API and
their website with the same data, the website is their core business. The API
is prone to being incomplete, buggy, subject to sudden deprecation,
unreasonably rate limited (crippling access to some objects below what a
casual human user has), and so on.

Conversely, overall document structure doesn't change much over time. I know
it _can_; there's a social contract that APIs should change slowly while
documents can change whenever, but that isn't what I observe in the wild. Even
on fairly major redesigns, the overall structure has minimal edits.

A technique I've used before (wasted effort in hindsight since web pages are
stable and I never have to update my scrapers) is to come up with several
semantically different ways of accessing a piece of data on a page. It serves
two purposes; you can recover from small page changes by having the different
methods vote, and you can detect most kinds of page changes by noticing
discrepancies, notifying yourself that the scraper needs to be updated soon.

~~~
throwaway894345
> the website is their core business

Granted, but there are lots and lots of ways they can break scrapers _in the
pursuit of_ their core business, such as a website redesign. For example,
moving from static HTML to a web framework would require your scraper to
actually run the JavaScript to generate the DOM in the state that a reader
might view it in, and this is quite a lot more complicated than walking the
static HTML.

~~~
waprin
Its not too complicated, you just need a headless browser. Having done a ton
of web scraping projects, I’d recommend just starting with this approach as
even sites that look pretty static use Javascript in subtle ways.

~~~
hermanradtke
Using a headless browser for scraping is a lot slower and resource intensive
than parsing HTML.

~~~
sullyj3
Sure, but it might be the only way to get the data.

~~~
hansvm
It might be, but _starting_ a scraping project with a headless browser might
be excessively expensive if you don't need the additional features.

------
ricardo81
It all makes sense if you believe in an information is universal idea where it
should all be freely available.

Though, you you get cases like celebritynetworth.com (no affiliation to me)
that was featured on HN where Google wanted to APIize the data and when that
was not available they decided to scrape it. It effectively killed the
business.

I think if a company decides to offer an API or other structured data then
fine, they've decided that their data can be shared in a way that makes it OK
to collect en masse.

I wouldn't say we should "learn to love" scraping when there are sites that
put a lot of man hours into the data that's seen on public html pages only for
someone else to spend a few hours scraping it and repurpose it.

Now, if the major traffic drivers like search engines and social networks were
able to differentiate who the true authoritative source is, that'd be great.
But they don't.

~~~
lazyjeff
The article avoids directly addressing this, but I think is the real elephant
in the room -- whether you have the right to web scrape for research.

The article says, "You might be able to use what you scrape, but it’s worth
checking that you can also legally share it. Ideally, the website content
licence [sic] will be readily available."

In practice, the data you want to scrape either has 1) no license/information
about whether they are okay with you scraping it (and if they are okay with
it, they usually offer the source data anyways), or more commonly they 2)
strictly prohibit it in the terms of service, in which case it's not clear
whether a scraper which mimics a browser falls under fair use.

It's an area where when you ask your university attorney, they exchange a few
emails with you and then avoid making any decision. I think it's just because
it's not well tested in court (at least when I encountered these situations),
and depends on whether you'd be subject to a takedown notice or lawsuit.

~~~
elwell
> it's not clear whether a scraper which mimics a browser falls under fair use

Or what about a 'human scraper' a la Amazon MTurk?

~~~
hansvm
It's rarely the scraper that runs afoul of legal complications (as always,
with exceptions); it's what you do with the scraped data.

------
gabrielsroka
The article mentions Python, Requests and BeautifulSoup.

Here's a short example that scrapes HN Favorites. [0]

    
    
      #!/usr/bin/env python3
      
      import requests
      from bs4 import BeautifulSoup
      
      username = input('username: ')
      
      # session uses connection pooling, often resulting in faster execution.
      session = requests.Session()
      
      base = 'https://news.ycombinator.com/'
      path = f'favorites?id={username}'
      while path:
          r = session.get(base + path)
          s = BeautifulSoup(r.text, 'html.parser')
          for a in s.select('a.storylink'):
              print(a.text, a['href'])
          more = s.select_one('a.morelink')
          path = more['href'] if more else None
    

[0] Python:
[https://github.com/gabrielsroka/gabrielsroka.github.io/blob/...](https://github.com/gabrielsroka/gabrielsroka.github.io/blob/master/getHNFavorites.py)

[1] JavaScript, runs in browser with a UI:
[https://github.com/gabrielsroka/gabrielsroka.github.io/blob/...](https://github.com/gabrielsroka/gabrielsroka.github.io/blob/master/getHNFavorites.js)

~~~
pj_mukh
People love BeautifulSoup, but I've had better luck with fully emulating a
browser with Selenium. While more complicated it seems to run into less
issues. Is there a rule of thumb as to when to use one over the other?

~~~
derivagral
Nowadays you could just use headless Chrome[1] or Firefox[2], though it looks
like Firefox is lagging behind a bit and still wants to require Selenium.

Personally, I loved using BS for hobbies until the SPA era started, and then I
had to either use headless (selenium back then was great) and/or monitor the
network for their API.

[1]
[https://developers.google.com/web/updates/2017/04/headless-c...](https://developers.google.com/web/updates/2017/04/headless-
chrome)

[2] [https://developer.mozilla.org/en-
US/docs/Mozilla/Firefox/Hea...](https://developer.mozilla.org/en-
US/docs/Mozilla/Firefox/Headless_mode)

------
welanes
The beautiful thing about web scraping is that it automates _so much_ tedious
work that the benefits in saved time offsets the brittleness inherent in
scraping data.

And with Puppeteer (also Playwright) it's never been easier. Recaptcha
solving, Ad blocking etc. in just a few lines of code[1].

I've built a business on the back of Puppeteer -
[https://simplescraper.io](https://simplescraper.io). Ten months in and we've
just passed 100 customers so there's mucho opportunities in solving these kind
of problems.

[1] [https://github.com/berstend/puppeteer-
extra/tree/master/pack...](https://github.com/berstend/puppeteer-
extra/tree/master/packages/puppeteer-extra)

~~~
davidbarker
This looks useful! Thanks for sharing. It reminds me of KimonoLabs, which I
loved before it was acquired by Palantir and shut down.

------
mherrmann
Shameless plug for my open source library Helium [1]. It extends Selenium's
API and makes it easier to use. The name comes from Helium also being a
chemical element, but lighter than Selenium.

1: [https://github.com/mherrmann/selenium-python-
helium](https://github.com/mherrmann/selenium-python-helium)

~~~
dotancohen
I've never heard of Helium, but this code snippet from the docs sold me:

    
    
        name = Text(to_right_of='Name:', below=Image(alt='Profile picture')).value
    

Thank you!

------
DoingIsLearning
> We strongly feel that more researchers should be developing code to help
> conduct their research, and then sharing it with the community. If manual
> data collection has been an issue for your project, a web scraper could be
> the perfect solution and a great beginner coding project.

I totally agree with the general idea of sharing code and tools, in an open
research community but incentivising someone from a field completely unrelated
to CS to 'just code' sounds like pretty bad advice. This is very likely to
lead to a lot of time spent, frustration, and underwhelming results.

Why not incentivise building a professional Software team at inter-
departmental University level and have that team act as multi-department
shared resource?

edit: missed words

~~~
Jeriko
There is a ton of programming in research, and almost all academic code is
written by the researchers, whether they are in biology, physics, humanities,
whatever. This has some of the drawbacks you mention, but the solution is not
so obvious.

First, almost all academic code is really simple from a software engineering
perspective, but really complex from a subject matter perspective. Having a
deep understanding of both the data and the relevant hypotheses is critical,
and is often actually helped by writing the code yourself. Trying to
communicate every feature requirement perfectly, and making sure every
assumption is met, to a third party CS person might be possible but is
definitely non-trivial.

Second, most of these projects are one-time use (code up some project over a
few months, write a paper, never touch again), and so spending a ton of time +
money making it robust and efficient is not really worth it. For things like
open source tools that are expected to be used by a lot of people it's much
more feasible to get engineers involved. The Chan Zuckerberg initiative is
actually funding a program that essentially does this [1].

[1][https://chanzuckerberg.com/rfa/essential-open-source-
softwar...](https://chanzuckerberg.com/rfa/essential-open-source-software-for-
science/)

~~~
tgvaughan
It's kinda funny that this isn't widely appreciated. It used to be that
research programs were most programs. Use of computers for other things was a
kind of spin-off technology. (Now it's the other way around, of course.)

------
whatever1
I am a proponent of web scraping. As long as the overall volume does not
overload the available resources I don't see why automated retrieval of
publicly available info is an issue. If the data in my website could help
someone else's research, a simple reference to the source would make me super
happy.

~~~
Aperocky
I like APIs better, but people gate their API behind paywalls.

------
umaar
I'm working on a video course for all things browser automation, whether it's
testing, scraping, auditing, deploying etc.

I'm keeping the codebase open on GitHub: [https://github.com/umaar/learn-
browser-testing/](https://github.com/umaar/learn-browser-testing/) so anyone
who wants to follow along can do so for free.

In the 2-scraping folder, there's a bunch of scraping examples such as
bypassing captchas, building an Amazon price checker, blocking CSS and
JavaScript resources during scraping to make the process much quicker, and a
few others. Hope it's useful.

------
mmerlin
I've written scrapers and web automation bots going as far back as 1998 (using
VB5 and VB6 automating IE)

Nowadays mostly I write custom scrapers use PHP (Guzzle, Curl, etc) and Python
(I was introduced to Beautiful Soup, by the "Python for Secret Agents" book)

I've tried several commercial scraper tools and services over the years but
few stuck for various reasons.

UbotStudio had great potential but in the end was buggy and painful, almost
abandonware.

Scrapinghub.com is decent enough but bit expensive for my projects.

80legs.com is cool for massive scale but was overly robots.txt restrictive at
the time I tried using it (for what I was scraping) and I don't like the
syntax.

A scraper colleague likes using Winautomation however it's no longer for sale
separately, because Microsoft acquired the company and rolled it into their
Power Automate SaaS (RPA/RDA focused)

There is a new tool called RTILA that I used for the very first time a 9 days
ago, which actually is the easiest way to create and run scrapers I have found
in 20 years.

The RTILA software currently has minimal documentation (apparently is being
worked on now), however the new features are being developed fast, all
releases are here on GitHub (see the frequency of releases)
[https://github.com/IKAJIAN/rtila-
releases/releases](https://github.com/IKAJIAN/rtila-releases/releases)

Another user of the software has produced several video tutorials here showing
how it works:
[https://www.youtube.com/channel/UCH6ov8LnB8-4ZF0yraxjw8Q/vid...](https://www.youtube.com/channel/UCH6ov8LnB8-4ZF0yraxjw8Q/videos)

You download it from GitHub but also still need to buy a license key from here
[https://codecanyon.net/user/ikajian](https://codecanyon.net/user/ikajian)

The RTILA home page is here
[https://rtila.ikajian.com/](https://rtila.ikajian.com/)

I am a genuine customer and have no connection in any way with the solo
founder developing it, (except for a few emails and support forum messages).
Genuine recommendation for writing a bot more easily.

~~~
Aperocky
Having written a few scrapers, including the browser kind. I found the trick
was to specialize.

Each website respond differently, some can just use requests, others would
need selenium with pre-existing profile. To develop a common utility is almost
futile.

I do have useful building blocks, but for each individual things I want to
scrape I scale out using project specific code. It's never too slow either -
the time it would take to fill in all of the required bits in a do it all tool
would have been similar.

------
andrewnc
During my studies, the web scraping projects were among my favorites. I adore
the challenge and building a robust scraper.

I recently sold a 'business' that does web scraping. Does anyone have insight
into which industries need more web scraping 'experts'?

~~~
ttttodayjunior
Hedge funds - check out the field of "alternative data"

~~~
andrewnc
oh, THAT is fascinating. Thanks for taking the time to respond. That
definitely feels like just what I'm looking for.

------
extremeMath
I failed a webscraping project due to strong anti bot detection.

They checked for bots through useragent/screen size, maybe mouse movements,
trends in searches (same area code), etc... (Can they really detect me through
my internet connection headers, despite proxies?)

It was impossible for me to scrape, they won.

~~~
billconan
same here.

there are 2 approaches they use that make developing bots very difficult.

1\. they detect device input. if there is no mouse movement, while the website
is being loaded, they will consider it's a bot.

2\. they detect the order of page visiting. A human visitor will not enumerate
all paths, instead, they follow certain patterns. This is detectable with
their machine learning model.

I really don't have a solution for #2

~~~
forgotmypw17
I think the solution is "hybrid" scraping with a human driving the clicks and
the scraper passively collecting the data.

If you record, you can probably teach AI to emulate.

~~~
extremeMath
I love this. I might try it. It doesn't scale, but that's okay for my project.

------
Dowwie
This seems like the tip of the iceberg for their work. The authors automated
document collection but are left with the task of searching through the docs
and crawling onward to linked documents. What kind of libraries are available
today for crawling pdf documents?

------
dannyfraser
My favourite kind of web scraping is when you have to pick apart some
undocumented API that serves up data to an SPA, then figure out how to tidy up
the response(s) into a single pandas dataframe. Always a satisfying feeling to
solve one of those little puzzles.

------
dastx
Surprised no one has mentioned weboob. If your interest isn't specifically
along the lines of "go to all pages, and get all of the data" and more along
the lines of "go to my bank, log in using username and password, get the
transactions" etc, weboob is ideal. It seems to be extremely well thought out
and easily extendable.it has plugins for a whole bunch of websites already,
including many, many banks, some of the biggest websites such as reddit,
Google, and many more. The only downside I've had so far is that it will fall
flat on it's arse if the website has to load must do javascripty things (it
can handle APIs and html easily though).

------
tracker1
I really like the options of node + cheerio, or node + puppeteer for these
kinds of jobs. Language of the browser, easy enough selection/usage, cheerio
if it's plain html, puppeteer if it's dynamic/spa content.

In the end, not hammering a site is key... also, on your own end, possibly
hashing the main content, so you aren't creating duplicate entries on your own
side are important.

As to fragility, that happens... in general, you need to update to match site
updates, but most sites won't be dramatically updated more than a couple times
a year if they aren't in active development.

------
wombatmobile
Corey Schafer is an articulate teacher of the craft of web scraping.

Python Tutorial: Web Scraping with BeautifulSoup and Requests

[https://youtu.be/ng2o98k983k](https://youtu.be/ng2o98k983k)

------
compscistd
One thing I tend to see a lot (90% of requests) is from people who are stuck
with mediocre CRMs and want better analytics and email reporting than they’re
given. The only way to accomplish anything is to spoof login and scrape
because the only api is through “partners” who either pay a premium or are
giant.

I turn these requests down because it’s a ticking time bomb when you add in
the element of login (password resets, 2FA, another point of change). On the
other hand, I wonder how different this is from what Plaid does...

------
spicyramen
Used to do web scrapping and took very long as most websites change formatting
and tags quite often. This required engineering just to maintain few sites all
the time.

------
uhtred
Seems like a lot of interesting sites recognise that I am scraping and block
access with a capture. I've no idea how to get around that.

~~~
gmac
How are you doing it? Automating an actual browser using the developer tools
is probably the most under-the-radar way, and also quite nice to work with.

See: [https://github.com/jawj/web-scraping-for-
researchers](https://github.com/jawj/web-scraping-for-researchers)

~~~
ivoecpereira
What out for "Devtools detect" :) [https://github.com/sindresorhus/devtools-
detect](https://github.com/sindresorhus/devtools-detect)

~~~
gmac
This is a good point, and I've noticed this phenomenon on some video sites
lately.

Although the caveats for this particular library[1] imply enough false
positives _and_ false negatives that it seems mostly useless. Sites that take
this seriously must be doing something smarter.

[1] "Doesn't work if DevTools is undocked and will show false positive if you
toggle any kind of sidebar."

------
intrasight
I've not done it recently, but have in the past used Power BI for web
scraping. It's a pretty slick solution. A quick Google found this article.

[https://spencerbaucke.com/2020/04/29/web-scraping-in-
power-b...](https://spencerbaucke.com/2020/04/29/web-scraping-in-power-bi/)

------
superdeeda
I’d be curious to know why they “worried” in the first place though.

~~~
mrzool
Genuinely thought this was an editorial by nature.com explaining how they
stopped worrying about researchers scraping their website for data and instead
embraced them, and the rationale behind this decision. Turned out to be a just
a 101 about web scraping in general. Disappointed.

------
junilop
found this lib in the comments, tried it and looks nice. can be good for
finding the xpath automatically and save the manual traversing overhead.
[https://github.com/alirezamika/autoscraper](https://github.com/alirezamika/autoscraper)

------
sys_64738
Won't switching to WASM do away with the ability to web scrape?

~~~
dotancohen
Much scraping is done with automated control of a (possibly headless) web
browser. It runs all the same scripts and has all the same features as a
standard web browser for sighted users.

------
dtjones
Learnt -> learned

~~~
edmundsauto
Learnt is the preferred correct word in non-US English speaking countries.
[https://www.grammarly.com/blog/learned-
learnt/](https://www.grammarly.com/blog/learned-learnt/)

