
Ask HN: Can I get in trouble for crawling using the Googlebot user agent? - goferito
A lot of sites have IP crawl restrictions, but add exceptions for Googlebot. Could Google or the crawled site legally do something when they find out?
======
CobrastanJorji
I feel like it's a good thing to maintain a certain level of professional
ethics, and, while it depends on the specifics of the situation, I'd suggest
that falsely claiming to third parties be something you aren't in order to do
something they don't want you to do generally falls short of that ethical bar.

Say your bot misbehaves and effectively starts DOSing a site with a whole lot
of pages, like a small Reddit clone or something. And say Reddit doesn't have
another way to determine between your bot and the Googlebot. You have now put
Reddit in a position where they have to either block the Googlebot (and
possibly lose a huge pile of money in the process) or else buy up a lot more
hardware and bandwidth to pay for your crawler as well. That's not cool, to
put it bluntly.

~~~
mathrawka
Not to detract from your point, but blindly blocking a User Agent because of a
bad actor and losing money is not a good solution.

A more robust solution can be coded using information from Google:
[https://support.google.com/webmasters/answer/80553?hl=en](https://support.google.com/webmasters/answer/80553?hl=en)

------
awinter-py
I'm not a lawyer and this isn't legal advice; but my instinct is you won't get
in trouble.

Most important argument: the chrome user-agent contains the word 'mozilla'.
Obviously (we argue) google isn't intending these to be accurate and instead
are some kind of compatibility mark.

Are you committing trademark violation? Given the nature of trademarks, it's
not clear that you are.

Are you misrepresenting yourself to the site in a way that violates the CFAA?
This is probably your biggest area of risk. But you can argue the site is
giving away information to google, a company whose slogan until recently was
'free the world's information'. Therefore they weren't taking plausible steps
to secure the information you've scraped.

~~~
mandude
What is Google's slogan these days?

~~~
awinter-py
'stayin alive, stayin alive'? They've done surprisingly well transitioning to
mobile & promoted their android guy to CEO, but haven't been able to diversify
revenue away from ads. And people are starting to _hate_ ads.

~~~
pacala
They fired their android guy. They gave android to their chrome guy and made
him the CEO.

------
mootothemax
It depends on what and how you're trying to crawl, it's trivial to verify a
"true" Googlebot using reverse DNS:

[https://support.google.com/webmasters/answer/80553?hl=en](https://support.google.com/webmasters/answer/80553?hl=en)

I know of a few sites that use this as the first step (of many!) to add bots
to their "naughty" list.

------
cube00
Sure, go join the realms of shady SEOs and malware, if I want to really stop
you I'll know you're not coming from a Google IP range.
[https://www.incapsula.com/blog/was-that-really-a-google-
bot-...](https://www.incapsula.com/blog/was-that-really-a-google-bot-crawling-
my-site.html)

However, consider what your ultimate end game is, if it's a website you expect
visitors to find through Google or the Play store, good luck once web masters
start reporting your misbehaving "Googlebot" crawler.

~~~
lend000
Unless you do it from a Google Cloud instance, that is.

~~~
mootothemax
>Unless you do it from a Google Cloud instance, that is.

What's the reverse DNS for Google Cloud IPs? Google says to check that
Googlebot's IP resolves to either a .google.com or .googlebot.com domain.

[https://support.google.com/webmasters/answer/80553?hl=en](https://support.google.com/webmasters/answer/80553?hl=en)

~~~
lend000
Good point -- although it makes sense why it isn't frequently implemented. DNS
lookups aren't cheap for this kind of thing.

~~~
Symbiote
It would be sufficient to let some requests come through from "Googlebot", and
then deal with them (block, rate-limit, whatever) once the DNS check has been
completed.

------
beejiu
I cannot comment on the legal aspects, but the Chrome user-agent contains
"like Gecko", "AppleWebKit" and "Safari". It is common for user-agents to be
constructed like this for compatibility. (Most for historical reasons.)

------
matt4077
Depends on the jurisdiction. In the US, the answer is "you really don't want
to find out".

In my home country, it's actually quite interesting: fraud usually requires
(a) a lie (conveying wrong information with intent), and (b) a financial cost
to the other party, and (c) a financial gain for you.

It's debatable at that level, already, because their loss is rather hard to
quantify, and probably small. Plus, I believe your financial gain must be
directly related to their cost.

And, finally, you actually have to lie to a human being. Lying to a machine
doesn't qualify. There was a guy who earned some 5-digit Euros amount by
producing fake bottles and feeding them into deposit machines–no crime!

------
d2p
What happens if you put "(not Googlebot)" on the end of your user agent?

------
riceo100
Maybe "Googlebot" is a trademark, or maybe you are violating the usage terms
the crawled sites have put in place by masquerading yourself... Could you get
in to trouble? _MAYBE_? Seems like a stretch in practice though. I've come
across people doing this to sites i've been an admin of relatively often, and
unless you're crawling with enough intensity to cause a DoS or doing something
nefarious with the content, most site owners would maybe roll their eyes and
move on.

~~~
jws
A quick search of the US trademarks does not show "Googlebot" as a trademark.

------
taftster
If you crawl a site, index it, and then use that for commercial purposes --
all while using Google's trademark to crawl -- yes, you'll probably get a
letter from Google.

As for the site owner, it's on them to decide what to do with your traffic.
HTTP is an open protocol and extensible. You could send almost anything in
your request, as allowed by the protocol. The site owner has opened their
service to the HTTP protocol and it's on them to decide what to do with your
traffic.

~~~
zackify
How would you get a letter from Google if you are never scraping google's
sites? They would never know?

~~~
return0
Even if they do know, what do they have to do with it? Does google have a
legal claim to their user-agent string exclusively?

~~~
taftster
If the name "googlebot" is trademarked, yes they would have a basis for a
claim. It would at least be leverage they could use if they believed you were
causing them harm in some way.

------
terminalcommand
Are you only crawling or also scraping the website?

If the sites in question only add an exception for googlebot and not other
crawlers (e.g. Yahoo, bing, etc.) I would say that it is against the site
owner's consent.

However if the site owner adds this exception also for other crawlers, you
could argue that the site owner's intent of _only_ allowing certain crawlers
has not been made explicit. In that case you'd have a chance against the
claims from the site's owner.

On the other hand Google could possibly sue you for using the user-agent
"Googlebot".

The important question here is: would they? If you stay under the radar no one
-even the courts- would bother.

PS: I am only a law student, I am not familiar with any
laws/regulations/precedents governing this specific issue. I think from the
site owner's perspective it's a grey area. From google's perspective brands
and ip are established concepts in law. This is a student's very personal
opinion at first sight, take it with a grain of salt :).

~~~
jrvidal
> On the other hand Google could possibly sue you for using the user-agent
> "Googlebot".

Genuinely curious: on what basis? Can you trademark (or similar) a user-agent?

~~~
terminalcommand
Google is a registered trademark, Googlebot may not be registered. However as
it clearly is affiliated with Google, even if Google had not registered it
they might pursue claims.

I am not familiar with U.S. law. If the courts in U.S. adopt a strictly
formalistic approach on using names in a business context, then the results
may be different.

However IMHO, Googlebot is clearly associated with Google and anyone who uses
Googlebot as their User Agent is tricking the sites owner's into believing
that the request was made from Google.

As for legal grounds, Google could sue on unfair competition.

------
dbg31415
May violate a site's TOS... but I don't think you'd ever get in any real
trouble... most you'd get is a cease and desist letter... have to waste some
time with lawyers... But I think it's on them to block you at the IP level if
you are violating the TOS / causing them grief. And look... if a crawl causes
them grief then they really need to invest more in DevOps. (Please do what you
can to encourage more companies to invest in DevOps!)

"I left my door unlocked and told my friends they could use my living room,
but then they put their feet up on my coffee table... Not cool, man!" Pretty
much the equivalent situation.

------
syrrim
One thing they could do is tell you to stop. If they have told you to stop,
and taken measures to block you out (blocking crawlers besides google) then
persisting is illegal. I believe somebody got succesfully sued by facebook for
continuing to scrape after facebook told them to stop. I'm not sure about the
legality if you haven't been explicitly asked to stop, but as long as you are
never blatant enough for them to notice, you shouldn't have any trouble.

------
Edmond
You won't get in trouble but if the site uses products from the like of Akamai
(Bot Manager) or Shape Security then you'll probably be blocked.

------
mightytightywty
Since when has it ever been illegal to claim you're someone or something that
you're not on the Internet? This is legal without question.

~~~
jdavis703
It might be against the terms of service of the website you're crawling, which
puts you in violation of the Computer Fraud and Abuse Act (i.e. you're
considered to be "hacking" them).

~~~
tomsmeding
Maybe that's called "hacking" in the relevant document, but I'd classify that
as overly broad.

~~~
jdavis703
Well however you classify it doesn't matter, what matters is how the
government classifies it. There's the tragic story of Aaron Swartz who was
caught up in this non-sense:
[https://en.wikipedia.org/wiki/Aaron_Swartz](https://en.wikipedia.org/wiki/Aaron_Swartz).

------
alxmdev
I wonder if most major sites that whitelist Googlebot also have exceptions for
Slurp, Bingbot, and other major search engine bots. If not, then it would be
interesting to know how these other companies deal with it, or if they just
politely back off.

------
fbomb
I handle a number of sites that require a login but need Google to index their
content. I verify that Googlebot requests are actually coming from a domain
owned by Google. I can't imagine that I'm the only one doing that.

~~~
jimktrains2
It's Google's preferred way:
[https://support.google.com/webmasters/answer/80553?hl=en](https://support.google.com/webmasters/answer/80553?hl=en)
of verification it seems.

------
jasonkostempski
Not saying you should live your life in irrational fear but, even if they just
think they can do something about it, that would be enough to mess up your
life significantly while having no effect on them at all.

------
mdekkers
To properly allow the Googlebot to crawl your site, you usually combine
checking googlebot with an IP whois lookup. This is also what Google
recommends.

------
oliv__
If I were going to do this the last place I'd ask this question is HackerNews.
But that's just me.

