Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Can I get in trouble for crawling using the Googlebot user agent?
44 points by goferito 4 days ago | hide | past | web | 48 comments | favorite
A lot of sites have IP crawl restrictions, but add exceptions for Googlebot. Could Google or the crawled site legally do something when they find out?





I feel like it's a good thing to maintain a certain level of professional ethics, and, while it depends on the specifics of the situation, I'd suggest that falsely claiming to third parties be something you aren't in order to do something they don't want you to do generally falls short of that ethical bar.

Say your bot misbehaves and effectively starts DOSing a site with a whole lot of pages, like a small Reddit clone or something. And say Reddit doesn't have another way to determine between your bot and the Googlebot. You have now put Reddit in a position where they have to either block the Googlebot (and possibly lose a huge pile of money in the process) or else buy up a lot more hardware and bandwidth to pay for your crawler as well. That's not cool, to put it bluntly.


Not to detract from your point, but blindly blocking a User Agent because of a bad actor and losing money is not a good solution.

A more robust solution can be coded using information from Google: https://support.google.com/webmasters/answer/80553?hl=en


I would hope the people at Reddit are smart enough to check an IP and not just a user-agent.

I'm not a lawyer and this isn't legal advice; but my instinct is you won't get in trouble.

Most important argument: the chrome user-agent contains the word 'mozilla'. Obviously (we argue) google isn't intending these to be accurate and instead are some kind of compatibility mark.

Are you committing trademark violation? Given the nature of trademarks, it's not clear that you are.

Are you misrepresenting yourself to the site in a way that violates the CFAA? This is probably your biggest area of risk. But you can argue the site is giving away information to google, a company whose slogan until recently was 'free the world's information'. Therefore they weren't taking plausible steps to secure the information you've scraped.


What is Google's slogan these days?

"Tremble in fear before us."

"Cower, brief mortals"

Look on my Works, ye Mighty, and despair!

'stayin alive, stayin alive'? They've done surprisingly well transitioning to mobile & promoted their android guy to CEO, but haven't been able to diversify revenue away from ads. And people are starting to hate ads.

They fired their android guy. They gave android to their chrome guy and made him the CEO.

"When the time comes, add an adblocker to your browser. This will put an end to the whole affair."

Speaking of law, and in the modern times this is in the theory of law, not practice, but intent is supposed to count. Now, that being said, the people at the level of the court systems are woefully ignorant of technology and IT more specifically. Intent seems to hold up only those rich enough to push very hard for it and prosecuting attorneys seem to get erections at the thought of putting someone in jail for even the perception of a cyber crime, these days, so.... 50/50 maybe =P

It depends on what and how you're trying to crawl, it's trivial to verify a "true" Googlebot using reverse DNS:

https://support.google.com/webmasters/answer/80553?hl=en

I know of a few sites that use this as the first step (of many!) to add bots to their "naughty" list.


Sure, go join the realms of shady SEOs and malware, if I want to really stop you I'll know you're not coming from a Google IP range. https://www.incapsula.com/blog/was-that-really-a-google-bot-...

However, consider what your ultimate end game is, if it's a website you expect visitors to find through Google or the Play store, good luck once web masters start reporting your misbehaving "Googlebot" crawler.


Unless you do it from a Google Cloud instance, that is.

>Unless you do it from a Google Cloud instance, that is.

What's the reverse DNS for Google Cloud IPs? Google says to check that Googlebot's IP resolves to either a .google.com or .googlebot.com domain.

https://support.google.com/webmasters/answer/80553?hl=en


Couldn't you use GWT Mobilizer to scrape a site then index that?

Like this: http://i.imgur.com/ocR54Yq.jpg


Good point -- although it makes sense why it isn't frequently implemented. DNS lookups aren't cheap for this kind of thing.

It would be sufficient to let some requests come through from "Googlebot", and then deal with them (block, rate-limit, whatever) once the DNS check has been completed.

.googleusercontent.com

You know you have to be able to crawl your own or client sites using screaming frog or deep crawl as google to identify any crawl/ crawl budget issues.

I cannot comment on the legal aspects, but the Chrome user-agent contains "like Gecko", "AppleWebKit" and "Safari". It is common for user-agents to be constructed like this for compatibility. (Most for historical reasons.)

Depends on the jurisdiction. In the US, the answer is "you really don't want to find out".

In my home country, it's actually quite interesting: fraud usually requires (a) a lie (conveying wrong information with intent), and (b) a financial cost to the other party, and (c) a financial gain for you.

It's debatable at that level, already, because their loss is rather hard to quantify, and probably small. Plus, I believe your financial gain must be directly related to their cost.

And, finally, you actually have to lie to a human being. Lying to a machine doesn't qualify. There was a guy who earned some 5-digit Euros amount by producing fake bottles and feeding them into deposit machines–no crime!


What happens if you put "(not Googlebot)" on the end of your user agent?

Maybe "Googlebot" is a trademark, or maybe you are violating the usage terms the crawled sites have put in place by masquerading yourself... Could you get in to trouble? _MAYBE_? Seems like a stretch in practice though. I've come across people doing this to sites i've been an admin of relatively often, and unless you're crawling with enough intensity to cause a DoS or doing something nefarious with the content, most site owners would maybe roll their eyes and move on.

A quick search of the US trademarks does not show "Googlebot" as a trademark.

If you crawl a site, index it, and then use that for commercial purposes -- all while using Google's trademark to crawl -- yes, you'll probably get a letter from Google.

As for the site owner, it's on them to decide what to do with your traffic. HTTP is an open protocol and extensible. You could send almost anything in your request, as allowed by the protocol. The site owner has opened their service to the HTTP protocol and it's on them to decide what to do with your traffic.


How would you get a letter from Google if you are never scraping google's sites? They would never know?

I was trying to say, if for example, you were creating a competitive search engine to Google, but using Google's name to build that service, you'd be in trouble.

How would Google know? They would start by setting up fictitious websites which would be seemingly unaffiliated to them. If your crawler was to hit the site, you would thus reveal yourself. I wouldn't at all be surprised that Google would have this kind of "honey pot" of sorts sitting out there watching for web crawlers (rogue or otherwise).

Google likely also has business partner relationships with big content producers, which I'm sure they are able to get reports back from regarding their crawling -- to ensure that Google is correctly finding all the content which the site owners want them to.

As an aside, I used to run such a honeypot website. Web crawler behavior is fascinating. I loved being able to find, detect and classify various forms of web crawlers. Some which followed robots.txt, some that didn't, some that went directly to robots.txt and then scraped the pages which were meant to be excluded. I wish I had kept the project going and formalized the results.


Even if they do know, what do they have to do with it? Does google have a legal claim to their user-agent string exclusively?

If the name "googlebot" is trademarked, yes they would have a basis for a claim. It would at least be leverage they could use if they believed you were causing them harm in some way.

Are you only crawling or also scraping the website?

If the sites in question only add an exception for googlebot and not other crawlers (e.g. Yahoo, bing, etc.) I would say that it is against the site owner's consent.

However if the site owner adds this exception also for other crawlers, you could argue that the site owner's intent of only allowing certain crawlers has not been made explicit. In that case you'd have a chance against the claims from the site's owner.

On the other hand Google could possibly sue you for using the user-agent "Googlebot".

The important question here is: would they? If you stay under the radar no one -even the courts- would bother.

PS: I am only a law student, I am not familiar with any laws/regulations/precedents governing this specific issue. I think from the site owner's perspective it's a grey area. From google's perspective brands and ip are established concepts in law. This is a student's very personal opinion at first sight, take it with a grain of salt :).


> On the other hand Google could possibly sue you for using the user-agent "Googlebot".

Genuinely curious: on what basis? Can you trademark (or similar) a user-agent?


Google is a registered trademark, Googlebot may not be registered. However as it clearly is affiliated with Google, even if Google had not registered it they might pursue claims.

I am not familiar with U.S. law. If the courts in U.S. adopt a strictly formalistic approach on using names in a business context, then the results may be different.

However IMHO, Googlebot is clearly associated with Google and anyone who uses Googlebot as their User Agent is tricking the sites owner's into believing that the request was made from Google.

As for legal grounds, Google could sue on unfair competition.


May violate a site's TOS... but I don't think you'd ever get in any real trouble... most you'd get is a cease and desist letter... have to waste some time with lawyers... But I think it's on them to block you at the IP level if you are violating the TOS / causing them grief. And look... if a crawl causes them grief then they really need to invest more in DevOps. (Please do what you can to encourage more companies to invest in DevOps!)

"I left my door unlocked and told my friends they could use my living room, but then they put their feet up on my coffee table... Not cool, man!" Pretty much the equivalent situation.


One thing they could do is tell you to stop. If they have told you to stop, and taken measures to block you out (blocking crawlers besides google) then persisting is illegal. I believe somebody got succesfully sued by facebook for continuing to scrape after facebook told them to stop. I'm not sure about the legality if you haven't been explicitly asked to stop, but as long as you are never blatant enough for them to notice, you shouldn't have any trouble.

You won't get in trouble but if the site uses products from the like of Akamai (Bot Manager) or Shape Security then you'll probably be blocked.

Since when has it ever been illegal to claim you're someone or something that you're not on the Internet? This is legal without question.

It might be against the terms of service of the website you're crawling, which puts you in violation of the Computer Fraud and Abuse Act (i.e. you're considered to be "hacking" them).

Maybe that's called "hacking" in the relevant document, but I'd classify that as overly broad.

Well however you classify it doesn't matter, what matters is how the government classifies it. There's the tragic story of Aaron Swartz who was caught up in this non-sense: https://en.wikipedia.org/wiki/Aaron_Swartz.

1) It's illegal to impersonate law enforcement online.

2) It's illegal to commit fraud by pretending to be someone else online (when, for example, using someone else's credit card or opening a bank account in their name).

3) You can be sued for things that are totally legal, such as blemishing Google's brand my impersonate them and doing things that annoy people.


I wonder if most major sites that whitelist Googlebot also have exceptions for Slurp, Bingbot, and other major search engine bots. If not, then it would be interesting to know how these other companies deal with it, or if they just politely back off.

I handle a number of sites that require a login but need Google to index their content. I verify that Googlebot requests are actually coming from a domain owned by Google. I can't imagine that I'm the only one doing that.

It's Google's preferred way: https://support.google.com/webmasters/answer/80553?hl=en of verification it seems.

Not saying you should live your life in irrational fear but, even if they just think they can do something about it, that would be enough to mess up your life significantly while having no effect on them at all.

To properly allow the Googlebot to crawl your site, you usually combine checking googlebot with an IP whois lookup. This is also what Google recommends.

If I were going to do this the last place I'd ask this question is HackerNews. But that's just me.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: