
The Scraping Problem and Ethics - codezero
http://blog.osvdb.org/2014/05/07/the-scraping-problem-and-ethics/
======
ChuckMcM
This is one of the more interesting policy questions on the web. Our search
engine crawls a lot of blogs and what not on the web, criminals who want to
find unpatched wordpress sites try to acrape our crawl by sending automated
(scripted) queries to find them. We have developed a number of defenses over
the years and pretty regularly ban them[1]. Here is the weird part though, if
they hired 300 people on mechanical turk and paid them each a dollar to do one
search, it probably would get them more information. Look at folks like 80legs
or other 'distributed' scrapers. They exist almost solely to subvert these
service terms. Are they evil? Creative?

One of the things that stands out is called out in this article. The people
involved really want this information, so much that they are willing to expend
time and effort to construct scraping bots and what have you . Why not just
buy it? How is it that someone gets a request from their boss to get some
information, but their boss expects them to get it for free? Can you imagine
if they said, "We need pens, pencils, notebooks, staplers, the works for the
office here. Oh and you can't spend any money getting that stuff, just get it
here." Would they construct some elaborate raid on a nearby Office supplies
store using a mercenary army of criminals? Why do that with information?

We did an experiment where we would 'grep the web' for you, basically run a
regex over a multi-billion page crawl, give you the first 50 results for
"free" and you could buy the complete set. I think we sold exactly one of
those.

It is a weird thing, the OP captured it perfectly.

[1] Its a violation of the terms of service.

~~~
drewcrawford
I think there is a great meta-question in here, about business models for
digital data and software.

Here you have a great case study, about an organization that tried to do a
volunteer model, and it didn't work. Then they pivoted to a commercial model,
but fundamentally they still believe in a free tier. But they have to cripple
that free tier pretty thoroughly, and even _still_ people abuse it.

I have a product I'm working on, that some people are apparently willing to
spend lots of money on. Ideally I would have some kind of low tier, so that
people without lots of money would be able to use it too. But I can't figure
out a way to segment the product so that everybody pays what they can afford
without bad apples abusing the low tier and ruining it for everybody. The
result is that I may end up only selling it to customers with deep pockets,
even though the product is much more broadly applicable.

~~~
MetaCosm
They made the process of paying for the software a laborious pain in the ass.
They are desperately trying to extract money from those who can pay, which
sadly drives away those who can pay but don't want an involved process.

I have worked at lots of companies where I had a monthly budget of 10k+ that I
could spend on whatever I wanted, but if I wanted any sort of complex deal
(can't just put on CC with a line item) -- had to bring in legal and other
groups -- instantly killed any interest.

"Licensing is based on the data needed (e.g. all of it vs subset), how it is
used (e.g. internal only, external, product integration), etc."

What a goddamn horror show. I simply want a product, I want to pay for it, and
I want to use it. Turning on Dropbox for Business was a decision made in about
5 minutes... "You all like it, already using it, awesome! I will get team
setup." \-- 5 minute later I had given Dropbox $3800.

I really think they are getting in their own way for no benefit. They have
created a very high barrier to EVEN HAVING A DISCUSSION about buying the
product. So, if I don't know exactly how will use it -- I can't purchase it.
Stupidity.

~~~
notastartup
Publishes data in public. Can't get people to pay for it. Blames people for
theft. The real thieves are the ones separating people from their wallets over
data that is available public and censuring it to those who won't pay.

------
alexnking
I'm more and more concerned that the legal and cultural environment for web
scraping would make it hard for a company like Google or Yahoo to be founded
today.

The internet isn't about "don't take my stuff", it's about spreading that
stuff around. I'm confused by people who want to make their data public, but
want to control exactly how people access it.

~~~
hnal943
_I 'm confused by people who want to make their data public, but want to
control exactly how people access it._

You mean like TV? or Radio? or print....

~~~
johnward
To be fair those mediums all kind of suck in their own way.

------
duey
The OSVDB website contains no signup page for commercial access. No pricing
either, purely sign up via contacting someone. From my experience whenever I
see this, I just refuse to use the service and look elsewhere. Contacting
someone is annoying and opens you up to repeat sales calls. Perhaps they
should make commercial API access easier to access rather than complain about
scrapers.

~~~
lukejduncan
It's not really reasonable to say "I don't like the way you market your
goods... so you really shouldn't be concerned with people stealing them."

~~~
duey
Not in the long term obviously, but you can't make a product extremely
difficult to buy and then complain that everyone is taking the easy route.

~~~
danielweber
Talking to someone on the phone is not "extremely difficult." Large companies
buy stuff by talking to people on the phone all the time.

~~~
the_ancient
and I have a policy that if the company does not practice open pricing by
publishing their prices online I will not do business with them..

Fuck that non-sense of pricing based on what ever they think they can scam me
out of

~~~
patio11
Apropos of nothing: customers often have an exaggerated notion on how
important it is to e.g. an enterprise software company that that company land
their account.

A conversation I've had a few times:

"We need it to do $THING_IT_WON'T_DO."

"In that case, it probably isn't a great fit for your needs."

"You don't understand. I won't buy it if it doesn't do that."

"I think I do understand. That's fine. You might consider trying $COMPETITOR,
although you should know their minimum spend is $1,000 a month."

"That's outrageous. You have a $29 plan."

"Yes. So you should go with the competitor if that requirement is worth $971 a
month to you."

"No, I want to spend $29, but I absolutely need that."

"I understand where you're coming from, but we do not offer that feature, and
if we did, we would charge prices close to what our competitor does for it."

"You're not working with me here."

"I'm trying to find a resolution which works for you, but including that
feature at $29 doesn't make business sense for me, so I won't do it."

"Put me on the phone with your boss."

"I'm afraid that isn't possible, as I sort of run things around here."

"What sort of businessman turns customers away."

"You're not a customer. If you were, you would be purchasing a product I sell
for the amount I sell it for. That isn't happening. That's fine. Have a nice
day."

------
DigitalSea
If I were in OSVDB's shoes, I would call these people out in an email and ask
them to pay a licencing fee. McAfee have always had a shady past, even when
they shook themselves clean of John, they have a history of scam-like
behaviour to make a quick-buck.

You should be rate-limiting how many requests free API users can make, like;
Twitter, Facebook and every other Internet provider does via their API. Make
it harder for people to obtain the information (to the best of your abilities)
and paying will become more of an option because they won't go to the trouble
of scraping it if it's impossible and will take a lot of time to do so.

Think of your offering as a car. You currently have no car alarm or
immobiliser, if you install a immobiliser and car alarm, you will make it very
hard for a thief to steal your car.

~~~
userbinator
On the other hand, large (even non-profit) organisations are precisely the
ones who have the resources to scrape steathily and widely, as would a
loosely-organised community of users... it's not hard to come up with
algorithms to respect the rate limits, balancing the load across multiple IPs
and accounts, and producing access patterns that don't look any different from
the rest of the site traffic.

~~~
rtpg
like much security, it's not about making it impossible, it's about making it
a lot less convenient/a bit harder.

At one point the effort to circumvent would cost more in man-hours than just
buying the product.

~~~
Mikushi
You can get scrapping libraries fairly easily. In my more shady past I
developed a library like that and shared it, HTTP client library with
automatic proxy rotation and rate limiting friendly.

When I used it (which was almost a decade ago) never ran into problems, plug a
list of 10,000 proxy and scrap away.

Not condoning that, which is a bit hypocrite of me, at the time I was mostly
doing what I was told and I thought I was clever. Now that I'm in a position
to have a positive impact, I do buy data and pay appropriate licence fees on
all software/data purchase, which still baffles some of my programmers who
constantly ask "why not crack it?", "you know I found a .zip on Google with
the data, why buy it?", and so forth.

I don't know what in programmer culture makes it so hard for us to pay for
something, some people put some effort behind that software / data collection,
and it's only fair to pay them.

~~~
test1235
Speaking as devil's advocate, it might just be more convenient to steal the
data.

Maybe I want to use your data casually once, and I don't want to sign up and
give you all my contact details and subscribe to your annual plan with all the
other optional extras.

Tough shit, you say? I'll just steal it then, and not because I can't afford
it, but because you're making it hard to pay.

~~~
notastartup
but how is it stealing when one does not lose inventory? If one person scrapes
a page, did you lose the source code? Does it not become available for the
next visitor? What possible loss do you incur that is directly tied to your
data? When you make data public with the intent of being readily accessible by
the public, how can you claim theft when you are achieving what you set out to
do? Does the accelerated rate of access suddenly become a theft? Does one need
to pay a third party to avoid the pain associated with manual hand labor for
simply hosting the data which is available to the public? Help me understand.

~~~
jerf
"but how is it stealing when one does not lose inventory?"

Scraping is not necessarily a no-victim situation. Even today after this stuff
has gotten cheaper, you're costing them bandwidth fees, and likely increasing
their server storage and CPU fees if it's on a metered hosting service, which
is quite likely nowadays. If you degrade their site's functionality, you may
chase away paying customers.

We need not hypothesize crazy third-order effects; you are taking money out of
their pockets by the act of scraping itself, independent of the question of
the value of the content.

"What about Google? etc." \- robots.txt-honoring scrapers that don't hammer
the sites at least have a plausible claim to permission. Scrapers are quite
likely to be ignoring the robots.txt.

~~~
tripzilch
> Scraping is not necessarily a no-victim situation. Even today after this
> stuff has gotten cheaper, you're costing them bandwidth fees, and likely
> increasing their server storage and CPU fees if it's on a metered hosting
> service, which is quite likely nowadays. If you degrade their site's
> functionality, you may chase away paying customers.

While technically correct, you are conflating the issues, because in none of
the cases (that I've seen mentioned so far in this thread) the problem is with
bandwidth/storage/CPU costs of _retrieval_ to any significant extent.

Instead, it appears that almost all of the costs are incurred _before_
retrieval: curating, sorting, etc.

I'm not arguing that it's okay, but it's just as much not stealing / thievery
as downloading movies or music isn't.

------
cproctor
I can't help recalling a post here a couple of years ago about the concept of
"hellbanning" scammers on ecommerce sites--in short, making it look like
everything is going fine, while actually isolating them completely from your
business logic. Orders with stolen cards appear to go through, and send
confirmation emails, but no real order is generated... In this case, you could
transparently poison the results served to identified scrapers with 5% bogus
vulnerabilities.

Or is that about as ethical as spiking trees to prevent illegal logging?

~~~
freshhawk
Heh, I've always called this the map makers trick (found an article about it
here: [https://theweek.com/article/index/241967/trap-streets-the-
cr...](https://theweek.com/article/index/241967/trap-streets-the-crafty-trick-
mapmakers-use-to-fight-plagiarism)) although I guess that is specific to
putting a small amount of fake data in your dataset to prove someone else used
it. The hellbanning metaphor does fit for return large amounts of poison
results.

It could be like spiking trees I guess, but that depends on the potential for
harm. I just looked it up and was surprised to find that only one injury has
ever been reported due to tree spiking. I guess it's a better talking point
than actual tactic.

And like lukejduncan said, this is definitely done in practice.

~~~
calroc
The point of spiking trees isn't to hurt people, in fact the tree-huggers
_tell_ the loggers that the trees have been spiked so that they won't attempt
to cut the trees. The point of the tactic is to prevent damage to the forest,
_not_ to actually hurt anyone.

------
jmzbond
This also goes way beyond scraping, in the sense of small start-ups doing all
kinds of hacks or clever manips to stay within the bounds of a free trial
service.

As someone who sees both sides… I don't know what to say. I'm running test
landing pages on Heroku with New Relic that pings the sites every minute to
ensure the dynes keep spinning and my users don't experience downtime. While
I'm careful to stay within fair use, this is at best obnoxious, because if
everyone did this, Heroku would certainly need to redefine what's free. From
my POV though, I am a bootstrapped entrepreneur and supporting 5 landing
pages. I simply don't have resources to pay for a dyno and test everything I
have in my head, especially not combined with the many other resources I'd
need to start paying for as well.

Or consider the kid in Florida who used Parse' free account for hundreds of
thousands of users. [1] (The article was on HN a few weeks ago, this was not
it's central point, just something I took away relevant to this comment.)

Part of the cause I think is that we live in a world where we're so used to
having things be free, it becomes an entitlement. Another is that all these
examples of start-up hacks and hustle stories, we kind of laud, don't we?
Everyone talks about how Airbnb scraped Craigslist and got a huge boon that
way, but few in critical tones. Should we? Or is that how competition and new
products get created (i.e., if the scraping hadn't happened, perhaps Airbnb
and the whole sharing economy would be less successful today).

These are philosophical questions, and I don't really have a solution, but
they are things to think about.

[1] [http://pando.com/2014/04/30/how-a-florida-kids-stupid-app-
sa...](http://pando.com/2014/04/30/how-a-florida-kids-stupid-app-saved-his-
familys-home-and-landed-him-on-the-main-stage-of-facebooks-f8/)

------
im3w1l
I was once developing a program for automated trading for my own personal use.
In the beginning I made periodical requests to my broker, scraped out the
price info, and made orders if the price was right. Once I had something sort
of working I decided that I wanted to do this the right way and ask them for
permission. They told me I would have to subscribe to a feed through some
other program. After many hours of reading the manual and trying to figure out
how to transfer the data to my application I again got it sort of working. But
I had already started to lose interest and other things had come up in my
life.

How is this relevant? Well I would have much preferred paying for scraping,
than trying to learn some new api. It increased the transaction cost. If you
further have to negotiate with a partner company, that sounds like even more
transaction cost, from having to send emails back and forth and the mental
effort of negotiating.

~~~
wslh
Beyond the moral discussion, I think a market for web scraping is a good thing
because currently there are a lot of unconnected people trying to buy/sell
this service.

Freelancer sites has a lot of offerings for web scraping but this niche has
its own issues.

------
lugg
Pretty sad. McAfee I dont really expect much less from. They've always been a
bunch of scam artists / con's in my mind.

I dontbhave time to look but couple of things striked me as odd. Is there a
reason you dont lock down your request limits? Also why dont you secure
access? Allow free accounts to be made and issue apis keys for them, at least
that way you can much more easily rate limit access to the apis and heavily
rate limit front end web requests down to reasonable numbers.

------
yaur
Shouldn't you at least be disallowing /show/* in robots.txt? Not that scrapers
are necessarily going to respect this... but the way your set up it seems like
this is semi-legit behavior.

~~~
userbinator
Maybe they want Google to be able to crawl their database (which it has
clearly done, as you'll see if you do a search.) That also raises some
questions...

~~~
yhager
Not completely by the specification, but I think this one works as expected.

    
    
        user-agent: *
        disallow: /
    
        user-agent: Googlebot
        allow: /

~~~
yaur
I think that userbinator's point is that the DO conditionally allow scraping
which makes their position even more tenuous IMO.

~~~
thefreeman
By a single search engine which probably provides the _vast_ majority of their
traffic.

While I don't necessarily agree with the concept of only allowing google to
index your site, comparing a search engine which feeds you business to a
company reselling your data with no attribution is not really fair in my
opinion.

~~~
yaur
Neither of the two parties mentioned is likely to resell the data directly,
both are likely to to create a derivative work which they will exploit
commercially as will Google. What part of the ToS make one OK while and the
other forbidden? How does this work when a smart scraper can just pull from
the Google or archive.org cache?

------
kylemaxwell
Web scraping isn't a crime - the simple act of downloading the data should not
be a problem here. (The reuse of the data might be, depending, but we don't
have that information right now.)

This doesn't even rise to the level of what Weev did.

~~~
stronglikedan
He's just making the point that it's unethical, which it is. Even if it boils
down to simple bandwidth theft.

~~~
__david__
I feel the opposite. I don't think it's unethical, even if it might be
illegal.

Even calling this "bandwidth theft" is quite the hyperbole—if the server can't
handle the bandwidth, then rate limit the requests.

I think if you're serving out pages to the public, you don't really get to
tell me what kind of browser I'm allowed to download it with. As long as I'm
speaking HTTP, it seems fair.

Sadly, that law has been slowly creeping against this mentality... Lately I
feel like I'm some old internet hippy with these views. On a site called
"Hacker News", no less.

~~~
ibmthrowaway218
> Sadly, that law has been slowly creeping against this mentality... Lately I
> feel like I'm some old internet hippy with these views. On a site called
> "Hacker News", no less.

I guess it's because everyday more and more people on here are finding
themselves on the other side of the fence, i.e. finding that some of their
users are ripping off their content/site.

~~~
ForHackernews
Maybe people need to come up with a better business model for websites than
"Put up some 'content' and sell ads next to it."

Sell a product or service.

------
msie
Why not put all information behind a login page and force people to sign up?
It looks like this site will be used by only a few people anyways. You can
then also track scrapers by login.

------
jwcrux
As a somewhat unrelated side note, I'm actually working on providing a free
API written in Golang for CVE and CVSS entries. You can find it here:
[https://github.com/jordan-wright/cve-api](https://github.com/jordan-
wright/cve-api)

------
elwell
Is it ethical to scrape from a scraper?

~~~
andreasvc
Ethical at least requires respecting robots.txt.

~~~
elwell
If robots.txt said to jump off a bridge, would your spider do it?

------
pbreit
It's hard to get excited about this. The data owner supposedly has a mission
to make the data available but then is concerned with such a thing. It doesn't
strike me as that difficult or expensive to make that sized database
available.

------
e28eta
We had a former Google Maps PM give a talk about building APIs, and he
addressed scraping. Something he pointed out is that the traffic pattern from
scrapers differs from regular users. Regular users make requests for the same
things often. Scrapers usually only request any given item once. The really
obvious ones do it in order from the beginning. He had a hilarious map showing
the requests of someone requesting business information starting from the top
left corner of the map. They got really good information on all of the
businesses between the North Pole and Greenland before being blocked.

------
kevcampb
I see that the "Open Source" Vulnerability Database changed it's name to the
"Open Sourced" Vulnerability Database in July 2013.

[https://web.archive.org/web/20130714002216/http://www.osvdb....](https://web.archive.org/web/20130714002216/http://www.osvdb.org/osvdb_license)

I guess they were hoping no one would notice the subtle but substantial change
to their service

------
binarysolo
Given these people's immediate reaction to resort to scraping, I'm honestly
kinda surprised they didn't setup basic practices of at least going through
proxies (or free ride off of Tor, ugh).

------
NoMoreNicksLeft
Maybe it's the supervillain in me, but I think I would try to come up with
software to recognize the scraping attempts, and rather than ban them just
have it generate fake data on the fly.

------
im3w1l
>including an expansive watch list capability,

I would avoid the word expansive. Too similar to expensive. Especially when
emailing people who are not native English speakers.

------
Gigablah
Looks like they forgot to conceal the full name and email of the guy from
s21sec in one of the quoted emails...

------
bitJericho
I would hope they sent a bill.

------
detroitcoder
Amateurs. If you are going to crawl distribute it across unique IP addresses.

------
diziet
What is the price? I assume the price is in the 5-10k+/month range, which is a
reasonable amount for data like this.

------
paulhauggis
What about Aaron Swartz? He essentially did the same exact thing to the
computers at MIT and he is somehow a freedom fighter when someone doing the
same thing to the osvdb website is considered "unethical".

This is straight from the open security foundation website:

"We believe that security information and services should be easily accessible
for all who have the need for such information and services"

~~~
codezero
While your sentiment is reasonable, I think the main difference here is that
McAfee and others mentioned host their own private vuln databases and do not
share them with anyone, so they were scraping to increase their own private
resources for commercial use.

Aaron was scraping private resources to share publicly.

~~~
jamra
More than that, he was scraping private resources that were freely populated.
He was not robbing content creators of their money. He was circumventing a
paywall to what should be free data.

~~~
npizzolato
That's quite the mental leap. He was circumventing a paywall, but that's not
robbing anyone of money because it should have been free in the first place?
Well, it wasn't free, even if you think you it should be. Thus the paywall.

~~~
codezero
To that point, the people who submitted to the journal did so knowing that it
costs money to access, it wasn't as if they were tricked into contributing to
a private pay-wall journal.

~~~
Blahah
For most scientists there isn't another system, and certainly when the system
was established nobody was envisaging a time when publishers would make
hundreds of % markup on every access of a paper. The prices reflect print
publishing costs and the historical lack of a cheap distribution system like
the internet. Now we have a hangover where people's careers are judged on
their ability to publish in high-impact journals, all of which charge several
thousand pounds extra to make the paper Open Access. So we're not tricked, but
we do it under duress.

The system is changing, Aaron helped.

------
notastartup
Scraping is not an ethics discussion. It's what you do with the data that
falls into the topic. Selling email lists scraped from websites to spammers
would be one. Using email lists to prevent spam would not be one.

The article is basically saying "I want to charge people for information I
have made public online, they won't pay, so they are obviously thieves by
refusing to do it manually by hand like they are supposed to."

Gimme a fucking break here. If you don't want the information to disseminate,
DO NOT PUBLISH IT ONLINE.

