
Scraping Isn't Hacking - danso
https://www.getrevue.co/profile/themarkup/issues/scraping-isn-t-hacking-261438
======
superkuh
Well, the article is right. But what happens in practice in cases like these
is that everything has been made illegal and it's just arbitrarily enforced.
All behavior that grandpa judge can't understand, like, say, anything beyond
using a browser (say, wget?) is a hacking tool and as long as the person you
annoyed has deep pockets or connections you'll go to prison.

The problem isn't with the details of any particular scraping or subtly. The
problem is with the law, CFAA 1030, and how it can be interpreted however the
particular DA wants. The judges are too ignorant of technology to stop them.

~~~
ghaff
>All behavior that grandpa judge can't understand,

Stop with the ageist crap.

~~~
cameronbrown
I don't think it's about ageism, but about the pace of technology. Many (if
not all?) of the concepts being debated were likely invented after most
current judges were born: the internet; HTTP; web scraping; browsers. It's not
easy to understand something you've not spent much time learning about. And
the rule book on laws regarding the internet is not fully written yet.

~~~
kerng
Also, goes the other way round.

~~~
cameronbrown
I totally agree with you on this also.

------
scarygliders
Scraping isn't hacking, sure.

But I've witnessed some particularly aggressive scrapers bring a shopping site
to its knees.

I've watched as multiple scrapers were launched at this site from different
blocks of IP addresses. I spent some hours blocking these.

After spending even more time, I tracked down these IP blocks - from multiple
countries - back to the one scraping company, who had obviously been hired by
a competitor to scrape the entire site for its prices for the ranges of goods
sold.

I consider such aggressive scraping to be a DDoS attack.

~~~
andyfleming
It’s more like service slaughter (which isn’t a term AFAIK) than denial of
service (since denial isn’t the goal).

It’s a tough problem though. In an ideal world, I think scrapers should pay
for the compute/resources required to serve their requests. I think that’s one
of the only ways to scale for this fairly. (This assumes you agree that
scraping of public content should be allowed in the first place.)

~~~
jka
That's an interesting idea. Sometimes a lot of the (service provision) cost
depends on how efficient the data access is at the host.

For example, let's take an API that includes a 'GET /data?start=0&end=100'
endpoint that retrieves items with key between 0 and 100.

If items are stored in sorted key order, that could be an efficient and
inexpensive query that retrieves exactly the 100 relevant items and doesn't
perform any other unnecessary work.

If, on the other hand, the database contains a billion items and they aren't
stored in key order, then with a caller-pays pricing model, it's (probably,
for the sake of argument) not worth running that query.

In some cases it might be reasonable to continue to pay for the expensive
queries. But what if all you need to do is convince the host to 'CREATE INDEX
... ON (key)'?

(all very hypothetical - I don't have any explicit answers. if the host
service were open source, the answer might be 'open a pull request and talk to
them, and hopefully reduce everyone's costs')

~~~
ethanwillis
If I was evil and the law was that scrapers had access BUT they had to pay
their fair share of the cost. I would just make their costs so absurdly high
that it wouldn't be beneficial to them.

~~~
andyfleming
Yeah, you’d almost need some way to genetically price this. You’d almost want
websites to opt into it. You can either accept the $x/request price or not
participate in the program.

Though, I guess at that point you might as well just sell API access.

------
paulryanrogers
Hacking as in a crime. For some reason I read this as people who scrape
websites aren't real 'hackers', in the no True Scotsman sense.

~~~
caymanjim
The term "hacking" long ago shifted to mean "illegal activity" unless
otherwise specified or obvious from the context. In the context of this title
it's clear what was meant. I'm an ancient, pedantic nerd, and even I don't
cling to the archaic 1970s meaning of "hacking".

~~~
ianhorn
Was "hacking" as a lay person term ever the same as "hacking" as used by
makers? If it was a shift, then yeah you're right. If it was a demonization as
the public learned the term at all, then I'd say there's more reason to
continue pushing to protect the term as a part of our community.

~~~
caymanjim
Until about the early 1980s, it exclusively referred to benign or prankish
tinkering. It started to drift increasingly toward the prankish end, and then
to the nefarious. Most early computer hacking—in the illegal sense—was also
fairly benign (I wardialed my area and broke into the local supermarket
computers to teach myself Unix and C). From the 90s on, its primary use
(especially in the mainstream) has referred to illegal activity, whether gray-
or black-hat. There was a concerted effort in the 80s and early 90s to use
"cracking" for the illegal activities, but it never caught on.

------
goldenkey
It's funny how scraping can be illegal yet Google and Microsoft and Yahoo and
DuckDuckGo, and every other search engine have a monopoly on it. It's legal
for them, because .....? Money money money! Google even hotlinks images for
Google images... pretty hilarious stuff.

~~~
dafoex
Maybe scrapers should respect robots.txt and go away if there's a `scraperbot`
entry, the same way Google goes away if you put `googlebot` in there.

------
jjice
Does anyone have tips for polite scraping? Just making sure not to hurt a site
(or get IP blocked)?

~~~
andyfleming
This is a tough problem because if you asked most sites/admins if you could
scrape their site, they don’t really have much motivation to say yes. For some
it feels like you are stealing their hard work. For others they don’t want to
pay for the requests your scraper will make to their site, etc.

Regardless, I’m an ideal world, some polite things would be to:

\- ask for permission before scraping, explaining how it’s neutral or positive
for their business (so they are more likely to support your continued scraping
or even potentially provide a more cost-effective format/API for you)

\- scrape at a reasonable pace

\- scrape off hours

\- make requests with a clear user agent so they know where the requests are
coming from

------
danbmil99
This case
[https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn](https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn)

Is relevant here. LinkedIn is invoking the Computer Fraud and Abuse Act for
scraping a website without logging in.

While hiQ won a preliminary injunction, the case is ongoing and LinkedIn /
Microsoft has asked the Supreme Court to weigh in.

~~~
patagurbon
Surely Microsoft would want Bing to have access to most websites right? Bing
probably doesn't bring in the same revenue though...

~~~
Google234
It’s in those websites interested to allow bing to index them. bing respects
the robot.txt

~~~
artembugara
are you sure about respecting robots.txt?

~~~
Google234
Yes they do. Does newscatcherapi respect them?

------
dafoex
I might be thinking about this the wrong way, but, scrapers as described by
this article are more-or-less targeted web crawlers, right? Don't we have a
de-facto standard for handling this?

User-agent: scraperbot Disallow: /

~~~
FastedCoyote
It's trivial to change the User Agent.

~~~
dafoex
It is, but what I'm suggesting is that a scraper identifies itself as such in
the user agent and complies with the robots exclusion standard. Google's
crawlers recognise that `googlebot` - among other strings - refers to them,
and a scraper should behave similarly if it doesn't want to be seen as
malicious. I think scrapers should be legal, but just like crawlers, they
should comply with a directive that says they aren't welcome.

And on the topic of trivial workarounds, it's even more trivial to just ignore
the robots.txt and scrape away, but that's what malicious bots performing
questionable activities do. One would hope a newsbot is being civil and not
behaving in a shady manner, but alas, I don't for one moment think any of
these news publishers have even heard of a robots.txt

------
swyx
i'm a little confused why this article doesn't also bring up the other high
profile supreme court scraping case - the HiQ LinkedIn case
([https://www.natlawreview.com/article/hiq-files-opposition-
br...](https://www.natlawreview.com/article/hiq-files-opposition-brief-
supreme-court-linkedin-cfaa-data-scraping-dispute))

Why does the Supreme Court seem to be considering these two related cases
together? seems like duplication?

~~~
artembugara
This case is really really specific. LinkedIn did not owe the data of people
who posted it there.

------
bernardlunn
Lousy title. Not as clickbaity might be sometimes scraping is legal and
sometimes it is ethical. Creator of original content here.

