The Legality of Web Scraping

htrp · on Sept 29, 2021

I would strongly suggest looking at a guide from an actual law firm like Akin Gump [1] vs a web scraping site that provide a call to action like the below

>Speak to a CrawlNow data expert today to explore new opportunities for using data to fuel growth for your business.

[1] https://www.akingump.com/a/web/soxXRQ6Nw48FehNvwpdjJ1/2jiuhx...

hafizhamid · on Sept 29, 2021

That guide from Akin Gump is listed under the Further Reading section at the end of the article. So yeah, read that too if you can!

By the way, a call to action, especially at the bottom, doesn't make the content any less credible. Nobody works hard to create quality content on the internet for the sake of it. There is always a direct or indirect motive/promotion involved. Let the readers judge whether the content is authentic or not.

fiddlerwoaroof · on Sept 29, 2021

I’ve never understood why using a different user agent should make a difference. Ethically, if I can see the data in a web browser, I already have access to it and no one has any business dictating to me the programs I may use to access that data.

caitlinface · on Sept 29, 2021

In my experience of trying to calculate accurate click and open analytics, defining a user agent tells the software that they may want to exclude them and gives them an easy way to do it.

Some bots use a user agent that looks like a normal browser, so you have to try to determine patterns based on timing of events, IP address, and other data. If the request has a custom user agent, all you have to do is exclude the datapoints that have that user agent on it.

I personally always appreciate the bots that have custom user agents.

paxys · on Sept 29, 2021

Accessing otherwise publicly accessible data with a different user agent is neither illegal nor unethical.

fiddlerwoaroof · on Sept 29, 2021

I don’t see why “publicly accessible data” is relevant: if I want to scrape my Facebook news feed and consume it as an RSS feed, I don’t see any justification for preventing me from doing this. The fact that I can see someone’s post in my web browser means I’m authorized to access that post.

paxys · on Sept 29, 2021

The difference is when you created a Facebook account and logged in using your username & password you agreed to their terms of service.

fiddlerwoaroof · on Sept 29, 2021

A document no one reads can’t really be a contract, regardless of what the law says.

paxys · on Sept 29, 2021

Well sadly what the law says matters

Jensson · on Sept 29, 2021

Are you sure? Doesn't courts typically throw out cases where companies try to argue that those are binding?

paxys · on Sept 29, 2021

No, it depends on what courts consider "unconscionable". We will likely never know for simple cases like changing your user agent because Facebook isn't going to sue someone over it, but they have fought (and won) against wholesale scraping of user profiles, which is mentioned in the linked article.

rootlocus · on Sept 29, 2021

> Ethically, if I can see the data in a web browser, I already have access to it

Someone on the other side is offering you a service by providing that data. They may literally have a business, and it may be in their business's best interest to dictate how you access that data. They can filter your UA, they can rate limit, they can ban you, etc.

Realistically, if the way you consume and use their public data affects their revenue, they can (and should) act to protect their income.

fiddlerwoaroof · on Sept 29, 2021

They can adjust how they serve traffic to me in various ways but I see no reason why they should be able to tell me that I can’t use an alternative program to view the data (a new browser I’m writing, etc.): if they’re worried about bots mass-scraping data, they have rate limits and other similar tools. If they’re worried about me saving a copy of the data I can see in normal usage of their site on my hard drive for later reference, they should mind their own business.

pdonis · on Sept 29, 2021

If your scraping program is fetching data at a similar rate and in a similar way to a browser, then yes, it shouldn't make any difference to the website owner. But most scraping programs are not like that. A web browser is fetching data for a human to read; that means it is not going to, for example, fetch a thousand pages from your website in just a few seconds using many parallel connections. But many scraping programs do things like that.

1vuio0pswjnm7 · on Sept 29, 2021

This seems to suggest the only thing that should matter to the website owner is the manner of retrieving data, not the data being retrieved. If it is public data, that makes sense.

I like to use sitemaps to download websites. I only use a single TCP connection and HTTP/1.1 pipelining. This can of take several successive requests depending on the number of URLs and the website's request limit per connection. For example, Akin Gump's sitemap has 9174 URLs and allows at least 2000 requests per connection. OTOH, I have downloaded sites with more URLs than that in a single connection. Pipelining is slower than parallel connections, more akin to downloading a large file. Some websites are faster than others.

According to the original RFCs, this is the proper netiquette. Even though the practice is of using parallel connections is widespread, I have never been able to find an RFC that advocated that. Popular browsers today open dozens of connections. All in the name of advertising.

I take the absence of a sitemap as an indication the website owner may have issues with automated data retrieval. For example, LinkedIn has no sitemap. However, most websites I encounter have sitemaps. Downloading websites without using parallel connections is not difficult. I do not need special software, only a small binary written in C to generate HTTP and a netcat. (Neither curl nor wget nor similar clients can do HTTP/1.1 pipelining.) I do not use a user-agent header. I have a filter I wrote for chunk transfer doecoding from stdin.

mewse · on Sept 29, 2021

Ethically, if someone is quite clearly using the user agent string to function as a lock restricting access to their IP, does it really matter that it’s an objectively terrible lock?

I mean, since you’ve brought ethics into it.

matheusmoreira · on Sept 29, 2021

Yeah. Why does a website operator care if I use a browser to view it or a script to extract some specific bit of data? It makes absolutely no sense to me.

mod · on Sept 29, 2021

Because you can't be targeted for ads if you don't see them.

You are their product, and they want to enforce the parameters of your relationship with them.

matheusmoreira · on Sept 29, 2021

I won't see ads either way. If I choose to use a browser, uBlock Origin will take care of any ads.

> You are their product, and they want to enforce the parameters of your relationship with them.

They're in for a shock then. I'm not a passive "product" to be sold to advertisers. Treating human beings like that is abuse.

repiret · on Sept 29, 2021

If you change your user agent header to avoid blocking, and the blocking is to prevent some harm, however trivial, it’s not hard to fit that into the definition of fraud: misreprenting a fact that the hearer relied on and then was harmed by that reliance [1].

[1]: https://www.law.cornell.edu/wex/fraud

snidane · on Sept 29, 2021

If you can twist the wording into a fraud, I can twist my words too.

I don't 'change' user agent. There is no mandated default value. I 'set' it to a value that the service accepts. I don't set it to avoid blocking, I set it to be served a response.

You seem to imply that some act of 'changing' a value results in a fraud. Nonsense. There is nothing like an open standard of authentication or identity without the service authenticating explicitly = there is no fraud.

The only grey territory is if the requester denies service to others. There is a higher chance of that happening using a script or a bot, but it can equally happen if one hires a grandma to keep clicking all day. Or 1000 grandmas at a time.

repiret · on Sept 29, 2021

If you send the googlebot user agent, and you’re not googlebot, that’s misrepresenting a fact. Same if you use curl and send a Firefox user agent. That’s all a fraudster needs to do - if the hearer of misrepresentation relies on it and is harmed then it’s fraud.

It doesn’t matter that the user agent is trivial to misrrpresent. Just because it’s trivial for me to phone someone up and tell them that I’m from Windows support and there’s a hold on their social security number and they need to wire me money to help get their grandson out of jail in Nigeria doesn’t make it not fraud if someone relies on those misrepresentations and is harmed by that reliance.

snidane · on Sept 29, 2021

So are then all browsers committing fraud when they put 'Mozilla' in the user agent despite not being Mozilla based?

zarzavat · on Sept 29, 2021

You could easily use Firefox as curl with a bit of scripting, then of course it would be using Firefox's user agent by default. There's no fraud because there is no difference between a browser user agent or a custom one. They both mean the same thing.

Jensson · on Sept 29, 2021

So when Firefox says it is chrome you consider it fraud? The user agent is there make the server better support your request, it serves no other purpose. Servers blocking certain user agents are misusing the field.

skissane · on Sept 29, 2021

Suppose you have a batch job which does web scraping. The job accepts a parameter which tells it what User-Agent value to use.

And you have a web page which launches the job. And that web page passes the User-Agent header from the incoming HTTP request as that job parameter.

You call the web page using curl to scrape a site. The batch job sends a curl User-Agent to the site, and the site blocks the request.

Next you try calling the web page using Google Chrome. This time the crawler batch job sends a Google Chrome User-Agent to the site, and the site allows the crawling to happen.

Did you commit fraud in the above scenario? And at which step was the fraud committed?

matheusmoreira · on Sept 29, 2021

And what harm does a simple web scraping script cause?

repiret · on Sept 29, 2021

> Trespass To Chattels is a law that governs the wrongful use of someone’s digital property.

Statements like that make me suspicious of the quality of the rest of the analysis.

trutannus · on Sept 29, 2021

In some situations it has, but in general this isn't true. Putting it in here as a general fact does not look good. If you look it up, it's actually controversial this sort of application. You're more likely to be hit under some version of unauthorized use of a computer system, or something of that nature.

Jensson · on Sept 29, 2021

> A website is the property of the website’s owner.

No, for example the information a user puts on linkedin is that users property. The user put it on linkedin since the user wants the world to see it, so scraping linkedin to find candidates for a job doesn't violate anyone's property rights. Linkedin might still complain about server costs which is a valid concern, but they can't say that they own the data users themselves submitted regardless of what their EULA says.

Treating user submitted data as property of the host just creates lock in, I don't see any reason why that would be a good policy.

Dayshine · on Sept 29, 2021

No, I didn't put it on LinkedIn for the world to see. I put it there for people to use following the T&C which prevent commercial scraping to spam me.

hnzix · on Sept 29, 2021

I've had my non-public LinkedIn account scraped a number of times. Thankfully waving the GDPR cannon now gets the information taken down (eventually).

JeffCarterXerox · on Sept 29, 2021

I think we should be looking more at intent rather than the semantics of how web scraping can be achieved.

Whether you're setting user agent strings or taking screenshots of content doesn't really matter. What matters is what you do with the content/data.

I could build a scraper to mine data on a mass scale to stick it all in a db and instantly clear it. What are my intentions here? Learn a new skill, experiment?

One example in the comments was about phone scammers. Similar phone calls have been made in jest on radio talk shows, maybe not about scamming but impersonating famous people. What differs is the intent.

Proving intent is a also difficult, as initial intent could be disguised to hide a more sinister agenda, akin to a money laundering operation. But at the root of everything will ly intent and that's what you have to get to regardless of the moral arguments.