Hacker News new | past | comments | ask | show | jobs | submit login

I disagree with your analogy. To me, the key word in "HTTP request" is request. A request is something that can be granted or not.



Perhaps more clearly would be any HTTP request that LinkedIn believes is in violation of their terms of service will be denied. It can be hard to know when the first request arrives if it is someone scraping the site or not, but once it is clear that it is someone scraping they actively deny all future requests. If they could know that the request coming in was going to be a scrape and not a page view they would preemptively deny it.


But what is the difference between a scrape and a page view? If a human looks at it once, after scraping, does it become a page view? Is pocket downloading content on my behalf for me to read later, a scraper? What's the difference between a scraper and an offline browser who's content a human never browses?


True, but in making the request, you will provide information on who is making that request. If you say, "I am a bot!", and they grant you permission, your request is legal.

But if you say, 'I am NOT a bot', like spoofing a browser's user agent string, but you are a bot, then you are requesting access under a pretense, in order to circumvent their terms of service. Kinda feels morally wrong, and illegal.


That argument works, insofar as it does, only for more recognizable bots and browsers. If I write a client of some sort that identifies itself as:

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Snackmaster Pro/666.0.666

What do you do?

I also tell my browser to lie about what it is sometimes, due to sites that are malfunctioning, but whose owners choose to document the errors instead of fixing them with "Use Chrome" (or IE, or whatever) checks.

Is that 'kinda' illegal or morally wrong (two very different things)?

If so, that seems like a belief that all sorts of browser defaults are 'kinda' wrong and/or illegal to change. Javascript? Lying about installed fonts/screen dimensions/whatever? Refusing to keep nonsession cookies between sessions? That slope would seem to get pretty slippery...


In your first case, if you are running on Windows NT 6.1 using WebKit on a new browser for humans called 'Snakemaster Pro', then you aren't doing anything wrong.

If by client you mean a robot, then you are pretending to be a browser and you are accessing the service without permission.

Let me ask you a question, say your client was hitting my service with that user agent, 100 times a second, crawling through urls sequentionaly. Lets say I added it to my robots.txt deny list and starting blocking that user agent. Would you change the user agent and continue?

If someone creates a site that says, 'Access to this site is for 640x480 browsers only, any other use is forbidden'. Then I think its pretty clear that its a stupid site but also that faking your screen resolution is accessing a site without consent. There is no slope, someone (Linkedin) putting explicit terms on their website is pretty clear.


Have you ever heard of "headless browsers" (like [chrome](https://github.com/dhamaniasad/HeadlessBrowsers/issues/37)? What are some defining characteristics of browsers that are absent in scraping clients? If I open a browser window while doing the scraping is that acceptable?


I very rarely use robots, and think I've only been "abusive" (not really abusive, in my book) once.

What if I send a null UA? Or use it as an opportunity to share my favorite quote?

What if the behavior of my software doesn't attack like a robot, does keep the request volume reasonable (use whatever you think is reasonable here) but also doesn't do what you might expect a human clicking around to do?


There isn't a universal 'I'm a bot' setting. There are user agent conventions, but they are hardly standard. Your point works in theory, but it's not something one can just implement and be reasonably confident that they won't be scraped.


No, it won't prevent being scraped. That's not the point I was making.

The point is, the scraper would have to hide their intentions and identity, which removes any claim they are being 'honest' in their intentions and not trying to circumvent the provider of the services efforts to prevent scraping.


The user agent header is not an authorization header. There is explicitly an authorization header.


I understand the appeal of this argument, but there are clearly cases where a computationally valid request and response is illegal despite the fact that the server "chose" to satisfy the request. An obvious example would be any exploit where an attacker can construct a particular request and get access to someone else's private data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: