

The Least Effective Method for Blocking Web Scraping of a Website - minimaxir
http://minimaxir.com/2014/09/buzzscrape/

======
Uehreka
I disagree that it was grayed out to prevent scraping. I don't have a better
suggestion, but professional intuition tells me the real reason was something
really mundane and dumb that couldn't be figured out by someone unfamiliar
with the codebase.

In the end though, I really liked that chart. How many weeks of articles did
you scrape to get that data?

~~~
minimaxir
In that run, I scraped about 10k articles total, mostly from books and
celebrities categories. (so 30% of them were listicles, believe it or not)

I would have scraped more, except I got hit with Facebook's rate limit (600
requests / 600 seconds) even though I took explicit steps (1.1s delay between
requests) to avoid said rate limit.

Although in this case, more data might not improve the accuracy of the chart.

------
daveloyall
The author tried harvesting the next url from the `Older` button _before_ he
tried

    
    
        urlFragment = "p=" + i++;
    

...wut? :)

~~~
minimaxir
That's how the Kimono Labs scraper works.

The code I use in my actual final parser is:

    
    
       bf_url = "http://www.buzzfeed.com/%s?p=%s&z=%s&r=1" % (category, current_page, access_token)

------
Axsuul
Or perhaps they block the Older button because only articles before page 10
are cached.

~~~
KMag
I would hope that any moderately popular site would implement popularity-based
caching. Popularity counters with halflives are pretty easy to implement.

------
bitJericho
The author tries desperately to find logic in BuzzFeed, when really it's about
as bad of a website, both content and design, as it gets.

~~~
minimaxir
I'm not a fan of BuzzFeed either, as evident by the sarcasm in the article.
But I can't deny that it _works_ , and their engineering and data teams are
top-notch.

~~~
Igglyboo
First off you don't actually have any evidence that it's greyed out to stop
web scrapers, and secondly it doesn't really work against anything but the
bare minimum worst scrapers because you can easily just change the url instead
of clicking the actual button.

~~~
minimaxir
That's the exact point I'm making. I cannot determine any logical reason for
BuzzFeed to disable it other than web scraping, which _in itself_ makes no
sense because it's easy to workaround.

~~~
wmil
I'm guessing an aggressive new manager came in and demanded it, and he was the
type who viewed counterarguments as disrespectful to his authority.

------
bjterry
It's interesting that there is an apparent discontinuity in views between 9
list items and 10 list items in the last graph.

~~~
minimaxir
There's a reason for that.

The behavior for listicle size = 9 entries is not a mistake. BuzzFeed, for
some reason, always set their redundant Summary listicle article to 9 entries.
[https://docs.google.com/spreadsheets/d/1w0GKyDvK9KWaOgjDJxV2...](https://docs.google.com/spreadsheets/d/1w0GKyDvK9KWaOgjDJxV2s4K3pf2pMCxqclkjU3pp6jo/edit?usp=sharing)

------
twelve40
Maybe the stupid type of bots in particular was causing problems for them.

------
plumeria
Does Cloudflare works for preventing scrapers?

------
notastartup
Incrementing the numeric value in url makes web scraping so much more
reliable.

Reliability and maintenance is a trade off in scraping. The big pain comes
from when websites change designs, imagine Buzzfeed from a few quarters now,
it would surely change some parts requiring you to relabel the data fields.

[https://scrape.it](https://scrape.it) automatically adapts to page layout
changes and continues scraping without interruption. It requires no
maintenance work from the user point of view.

The other thing I'd like to see Scrape.it do is the ability to scrape by
incrementing the page url but unlike Kimono, Scrape.it is designed to work on
dynamic web pages as the primary goal (sites requiring ajax POST to increment
page), it's quite amazing to see how many sites out there simply use page
number in their urls.

