Does this prevent Google from returning captchas if you use a robot to scrape th...

kabacha · on Sept 10, 2019

I mean it should. That is also a huge anti-competitive action that just isn't pursued by anyone yet: google makes money of scraping and denies scrapers scraping them - that's all sorts of messed up.

The problem is that someone would have to sue google first and no one will do that unless there's big business incentive and big business can already scrape the shit out of google.

This is the weird thing about web-scraping. Big companies can get around protections quite easily - it's the small scripts and average users that get hurt by them. No one is going to tell you this because people would stop buying Cloudflare's "anti-bot 99% effective anti-ddos money saving package" which is complete bullshit.

skybrian · on Sept 10, 2019

Google respects robots.txt, so it might be hard to prove that they are accessing websites without their implied consent.

Also, their own robots.txt contains "Disallow: /search". So, there is arguably no inconsistency, either.

But, what does this new ruling mean for robots.txt?

SnorkelTan · on Sept 10, 2019

I think OP is getting at the nature of the relationship is kinda imbalanced. Consider basically most of their website is off limits: https://www.google.com/robots.txt

nojvek · on Sept 10, 2019

you can have your website offlimits from google by having a robots.txt too. What's the problem? People willingly want google to index them so they appear in search results and google can send them traffic.

dangero · on Sept 10, 2019

Google only scrapes sites that allow it by their robots.txt file so I don’t think their policy is as hypocritical as you are making it sound.

taptaptapimin · on Sept 10, 2019

This is true, but only technically. Google won't actively scrape anything disallowed in robots.txt, but those resources can still be indexed if found in the many other ways Google aggregates data, all of which is automated.

Robots.txt isn't something that bars access to information. It's just a notice that the administrator does not want large amounts of queries against certain resources.

BoorishBears · on Sept 10, 2019

>Robots.txt isn't something that bars access to information. It's just a notice that the administrator does not want large amounts of queries against certain resources.

Many times Robots.txt are implemented with the interest of barring access to information.

This works by relying on scrapers respecting the file, but it's no different than a no-loitering sign which itself cannot actively stop someone who is loitering.

Google doesn't have a Robots.txt disallowing search because it can't handle a large amount of queries against a resource...

jjohansson · on Sept 10, 2019

They still scrape and index sites blocked by robots.txt, but they often don’t display those sites in their SERPs (but sometimes they still do)

C14L · on Sept 10, 2019

Never seen it in any logs in 20 years.

Do you have a source for that claim?

jjohansson · on Sept 11, 2019

Looks like I’m wrong. They will rank pages blocked by robots.txt, but won’t index them.

https://www.google.ca/amp/s/www.searchenginejournal.com/goog...

kabacha · on Sept 10, 2019

Google has an effective monopoly on search engine market - you _can't realistically_ block google from scraping your website. They have this power and they are abusing it.

Also robots.txt is bullshit, if a person can access a public website why automated script shouldn't - technically speaking it's the same thing.

salawat · on Sept 12, 2019

It really isn't.

Imagine you have a piece of information that your neighbors in town come to you for every once in a while. They come over every now and again, ask you for it, maybe even bring you cookies for the trouble, and you provide it.

Then there's Ted.

Ted is insatiable. He hounds you at every minute of your day, constantly asking you the same question over and over. You've done everything you can. You tried to reason with Ted. You tried to contact whoever it is that brought Ted to your neighborhood. You even got so desperate, you moved a few houses down to escape the incessant hounding. That only worked for a little while though, and Ted found you again.

So you tried to stop answering the door; no use, he pokes his head in every time you're in the garage. You demanded people identify themselves first. Oh well, it's changed little. Now he just names himself gibberish names before hounding you for the same things over and over again.

This would not by any stretch of the imagination be acceptable behavior between two people. The main factor in determination for a court injunction would likely be physical trespass, or public nuisance; but no digital equivalent exists currently other than the CFAA, in the sense that in as much as one can prove that the access to the system is inconsistent with the intent or legal terms of providing a service, one may seek relief.

The problem is, LinkedIn has failed to make a convincing argument in the eyes of the appellate court that hiQ's Ted is violating the CFAA while LinkedIn has proactively engaged in activity disrupting hiQ's ability to do business; business which was consistent with the service granted to unknown members of the public at large.

In the Court's eyes, from the sounds of it, it appears LinkedIn is doing the greater harm.

What it looks like to me is this is setting up a common law framework that is going to cause website/service providers to have to choose from the get-go what their relationship to the Net is.

Are you just a transitor providing a service over the Net to a limited, select clientele bound by very specific terms of service? Then you may have a leg to stand on in fending off malignant Teds, but your exposure and onboarding will need to have concomitant friction to make the case to the Court that these Teds were never meant to be serviced in the first place.

Or, are you providing a public content portal, meant to make things accessible to everyone, with minimal terms? In which case, no legal Ted relief for you!

Just because it is your "syatem' and it isn't connected to your nervous system, does not mean it isn't capable of being caused harm to, or inflicting harm on someone else with careless muckery.

The one thing that disturbs me most is how the Court has disregarded the chilling effect that interpreting a duty to maintaining visibility may incur. A First Amendment challenge may end up being the inevitable result of this legal proceeding.

3xblah · on Sept 10, 2019

All true.