I would strongly suggest looking at a guide from an actual law firm like Akin Gump [1] vs a web scraping site that provide a call to action like the below
>Speak to a CrawlNow data expert today to explore new opportunities for using data to fuel growth for your business.
That guide from Akin Gump is listed under the Further Reading section at the end of the article. So yeah, read that too if you can!
By the way, a call to action, especially at the bottom, doesn't make the content any less credible. Nobody works hard to create quality content on the internet for the sake of it. There is always a direct or indirect motive/promotion involved. Let the readers judge whether the content is authentic or not.
I’ve never understood why using a different user agent should make a difference. Ethically, if I can see the data in a web browser, I already have access to it and no one has any business dictating to me the programs I may use to access that data.
In my experience of trying to calculate accurate click and open analytics, defining a user agent tells the software that they may want to exclude them and gives them an easy way to do it.
Some bots use a user agent that looks like a normal browser, so you have to try to determine patterns based on timing of events, IP address, and other data. If the request has a custom user agent, all you have to do is exclude the datapoints that have that user agent on it.
I personally always appreciate the bots that have custom user agents.
I don’t see why “publicly accessible data” is relevant: if I want to scrape my Facebook news feed and consume it as an RSS feed, I don’t see any justification for preventing me from doing this. The fact that I can see someone’s post in my web browser means I’m authorized to access that post.
No, it depends on what courts consider "unconscionable". We will likely never know for simple cases like changing your user agent because Facebook isn't going to sue someone over it, but they have fought (and won) against wholesale scraping of user profiles, which is mentioned in the linked article.
> Ethically, if I can see the data in a web browser, I already have access to it
Someone on the other side is offering you a service by providing that data. They may literally have a business, and it may be in their business's best interest to dictate how you access that data. They can filter your UA, they can rate limit, they can ban you, etc.
Realistically, if the way you consume and use their public data affects their revenue, they can (and should) act to protect their income.
They can adjust how they serve traffic to me in various ways but I see no reason why they should be able to tell me that I can’t use an alternative program to view the data (a new browser I’m writing, etc.): if they’re worried about bots mass-scraping data, they have rate limits and other similar tools. If they’re worried about me saving a copy of the data I can see in normal usage of their site on my hard drive for later reference, they should mind their own business.
If your scraping program is fetching data at a similar rate and in a similar way to a browser, then yes, it shouldn't make any difference to the website owner. But most scraping programs are not like that. A web browser is fetching data for a human to read; that means it is not going to, for example, fetch a thousand pages from your website in just a few seconds using many parallel connections. But many scraping programs do things like that.
This seems to suggest the only thing that should matter to the website owner is the manner of retrieving data, not the data being retrieved. If it is public data, that makes sense.
I like to use sitemaps to download websites. I only use a single TCP connection and HTTP/1.1 pipelining. This can of take several successive requests depending on the number of URLs and the website's request limit per connection. For example, Akin Gump's sitemap has 9174 URLs and allows at least 2000 requests per connection. OTOH, I have downloaded sites with more URLs than that in a single connection. Pipelining is slower than parallel connections, more akin to downloading a large file. Some websites are faster than others.
According to the original RFCs, this is the proper netiquette. Even though the practice is of using parallel connections is widespread, I have never been able to find an RFC that advocated that. Popular browsers today open dozens of connections. All in the name of advertising.
I take the absence of a sitemap as an indication the website owner may have issues with automated data retrieval. For example, LinkedIn has no sitemap. However, most websites I encounter have sitemaps. Downloading websites without using parallel connections is not difficult. I do not need special software, only a small binary written in C to generate HTTP and a netcat. (Neither curl nor wget nor similar clients can do HTTP/1.1 pipelining.) I do not use a user-agent header. I have a filter I wrote for chunk transfer doecoding from stdin.
Ethically, if someone is quite clearly using the user agent string to function as a lock restricting access to their IP, does it really matter that it’s an objectively terrible lock?
Yeah. Why does a website operator care if I use a browser to view it or a script to extract some specific bit of data? It makes absolutely no sense to me.
If you change your user agent header to avoid blocking, and the blocking is to prevent some harm, however trivial, it’s not hard to fit that into the definition of fraud: misreprenting a fact that the hearer relied on and then was harmed by that reliance [1].
If you can twist the wording into a fraud, I can twist my words too.
I don't 'change' user agent. There is no mandated default value. I 'set' it to a value that the service accepts. I don't set it to avoid blocking, I set it to be served a response.
You seem to imply that some act of 'changing' a value results in a fraud. Nonsense. There is nothing like an open standard of authentication or identity without the service authenticating explicitly = there is no fraud.
The only grey territory is if the requester denies service to others. There is a higher chance of that happening using a script or a bot, but it can equally happen if one hires a grandma to keep clicking all day. Or 1000 grandmas at a time.
If you send the googlebot user agent, and you’re not googlebot, that’s misrepresenting a fact. Same if you use curl and send a Firefox user agent. That’s all a fraudster needs to do - if the hearer of misrepresentation relies on it and is harmed then it’s fraud.
It doesn’t matter that the user agent is trivial to misrrpresent. Just because it’s trivial for me to phone someone up and tell them that I’m from Windows support and there’s a hold on their social security number and they need to wire me money to help get their grandson out of jail in Nigeria doesn’t make it not fraud if someone relies on those misrepresentations and is harmed by that reliance.
You could easily use Firefox as curl with a bit of scripting, then of course it would be using Firefox's user agent by default. There's no fraud because there is no difference between a browser user agent or a custom one. They both mean the same thing.
So when Firefox says it is chrome you consider it fraud? The user agent is there make the server better support your request, it serves no other purpose. Servers blocking certain user agents are misusing the field.
Suppose you have a batch job which does web scraping. The job accepts a parameter which tells it what User-Agent value to use.
And you have a web page which launches the job. And that web page passes the User-Agent header from the incoming HTTP request as that job parameter.
You call the web page using curl to scrape a site. The batch job sends a curl User-Agent to the site, and the site blocks the request.
Next you try calling the web page using Google Chrome. This time the crawler batch job sends a Google Chrome User-Agent to the site, and the site allows the crawling to happen.
Did you commit fraud in the above scenario? And at which step was the fraud committed?
In some situations it has, but in general this isn't true. Putting it in here as a general fact does not look good. If you look it up, it's actually controversial this sort of application. You're more likely to be hit under some version of unauthorized use of a computer system, or something of that nature.
> A website is the property of the website’s owner.
No, for example the information a user puts on linkedin is that users property. The user put it on linkedin since the user wants the world to see it, so scraping linkedin to find candidates for a job doesn't violate anyone's property rights. Linkedin might still complain about server costs which is a valid concern, but they can't say that they own the data users themselves submitted regardless of what their EULA says.
Treating user submitted data as property of the host just creates lock in, I don't see any reason why that would be a good policy.
I think we should be looking more at intent rather than the semantics of how web scraping can be achieved.
Whether you're setting user agent strings or taking screenshots of content doesn't really matter. What matters is what you do with the content/data.
I could build a scraper to mine data on a mass scale to stick it all in a db and instantly clear it. What are my intentions here? Learn a new skill, experiment?
One example in the comments was about phone scammers. Similar phone calls have been made in jest on radio talk shows, maybe not about scamming but impersonating famous people. What differs is the intent.
Proving intent is a also difficult, as initial intent could be disguised to hide a more sinister agenda, akin to a money laundering operation. But at the root of everything will ly intent and that's what you have to get to regardless of the moral arguments.
>Speak to a CrawlNow data expert today to explore new opportunities for using data to fuel growth for your business.
[1] https://www.akingump.com/a/web/soxXRQ6Nw48FehNvwpdjJ1/2jiuhx...