I feel like it's a good thing to maintain a certain level of professional ethics, and, while it depends on the specifics of the situation, I'd suggest that falsely claiming to third parties be something you aren't in order to do something they don't want you to do generally falls short of that ethical bar.
Say your bot misbehaves and effectively starts DOSing a site with a whole lot of pages, like a small Reddit clone or something. And say Reddit doesn't have another way to determine between your bot and the Googlebot. You have now put Reddit in a position where they have to either block the Googlebot (and possibly lose a huge pile of money in the process) or else buy up a lot more hardware and bandwidth to pay for your crawler as well. That's not cool, to put it bluntly.
I'm not a lawyer and this isn't legal advice; but my instinct is you won't get in trouble.
Most important argument: the chrome user-agent contains the word 'mozilla'. Obviously (we argue) google isn't intending these to be accurate and instead are some kind of compatibility mark.
Are you committing trademark violation? Given the nature of trademarks, it's not clear that you are.
Are you misrepresenting yourself to the site in a way that violates the CFAA? This is probably your biggest area of risk. But you can argue the site is giving away information to google, a company whose slogan until recently was 'free the world's information'. Therefore they weren't taking plausible steps to secure the information you've scraped.
'stayin alive, stayin alive'? They've done surprisingly well transitioning to mobile & promoted their android guy to CEO, but haven't been able to diversify revenue away from ads. And people are starting to hate ads.
Speaking of law, and in the modern times this is in the theory of law, not practice, but intent is supposed to count. Now, that being said, the people at the level of the court systems are woefully ignorant of technology and IT more specifically. Intent seems to hold up only those rich enough to push very hard for it and prosecuting attorneys seem to get erections at the thought of putting someone in jail for even the perception of a cyber crime, these days, so.... 50/50 maybe =P
However, consider what your ultimate end game is, if it's a website you expect visitors to find through Google or the Play store, good luck once web masters start reporting your misbehaving "Googlebot" crawler.
It would be sufficient to let some requests come through from "Googlebot", and then deal with them (block, rate-limit, whatever) once the DNS check has been completed.
You know you have to be able to crawl your own or client sites using screaming frog or deep crawl as google to identify any crawl/ crawl budget issues.
I cannot comment on the legal aspects, but the Chrome user-agent contains "like Gecko", "AppleWebKit" and "Safari". It is common for user-agents to be constructed like this for compatibility. (Most for historical reasons.)
Depends on the jurisdiction. In the US, the answer is "you really don't want to find out".
In my home country, it's actually quite interesting: fraud usually requires (a) a lie (conveying wrong information with intent), and (b) a financial cost to the other party, and (c) a financial gain for you.
It's debatable at that level, already, because their loss is rather hard to quantify, and probably small. Plus, I believe your financial gain must be directly related to their cost.
And, finally, you actually have to lie to a human being. Lying to a machine doesn't qualify. There was a guy who earned some 5-digit Euros amount by producing fake bottles and feeding them into deposit machines–no crime!
Maybe "Googlebot" is a trademark, or maybe you are violating the usage terms the crawled sites have put in place by masquerading yourself... Could you get in to trouble? _MAYBE_? Seems like a stretch in practice though. I've come across people doing this to sites i've been an admin of relatively often, and unless you're crawling with enough intensity to cause a DoS or doing something nefarious with the content, most site owners would maybe roll their eyes and move on.
If you crawl a site, index it, and then use that for commercial purposes -- all while using Google's trademark to crawl -- yes, you'll probably get a letter from Google.
As for the site owner, it's on them to decide what to do with your traffic. HTTP is an open protocol and extensible. You could send almost anything in your request, as allowed by the protocol. The site owner has opened their service to the HTTP protocol and it's on them to decide what to do with your traffic.
I was trying to say, if for example, you were creating a competitive search engine to Google, but using Google's name to build that service, you'd be in trouble.
How would Google know? They would start by setting up fictitious websites which would be seemingly unaffiliated to them. If your crawler was to hit the site, you would thus reveal yourself. I wouldn't at all be surprised that Google would have this kind of "honey pot" of sorts sitting out there watching for web crawlers (rogue or otherwise).
Google likely also has business partner relationships with big content producers, which I'm sure they are able to get reports back from regarding their crawling -- to ensure that Google is correctly finding all the content which the site owners want them to.
As an aside, I used to run such a honeypot website. Web crawler behavior is fascinating. I loved being able to find, detect and classify various forms of web crawlers. Some which followed robots.txt, some that didn't, some that went directly to robots.txt and then scraped the pages which were meant to be excluded. I wish I had kept the project going and formalized the results.
If the name "googlebot" is trademarked, yes they would have a basis for a claim. It would at least be leverage they could use if they believed you were causing them harm in some way.
Are you only crawling or also scraping the website?
If the sites in question only add an exception for googlebot and not other crawlers (e.g. Yahoo, bing, etc.) I would say that it is against the site owner's consent.
However if the site owner adds this exception also for other crawlers, you could argue that the site owner's intent of only allowing certain crawlers has not been made explicit. In that case you'd have a chance against the claims from the site's owner.
On the other hand Google could possibly sue you for using the user-agent "Googlebot".
The important question here is: would they? If you stay under the radar no one -even the courts- would bother.
PS: I am only a law student, I am not familiar with any laws/regulations/precedents governing this specific issue. I think from the site owner's perspective it's a grey area. From google's perspective brands and ip are established concepts in law. This is a student's very personal opinion at first sight, take it with a grain of salt :).
Google is a registered trademark, Googlebot may not be registered. However as it clearly is affiliated with Google, even if Google had not registered it they might pursue claims.
I am not familiar with U.S. law. If the courts in U.S. adopt a strictly formalistic approach on using names in a business context, then the results may be different.
However IMHO, Googlebot is clearly associated with Google and anyone who uses Googlebot as their User Agent is tricking the sites owner's into believing that the request was made from Google.
As for legal grounds, Google could sue on unfair competition.
May violate a site's TOS... but I don't think you'd ever get in any real trouble... most you'd get is a cease and desist letter... have to waste some time with lawyers... But I think it's on them to block you at the IP level if you are violating the TOS / causing them grief. And look... if a crawl causes them grief then they really need to invest more in DevOps. (Please do what you can to encourage more companies to invest in DevOps!)
"I left my door unlocked and told my friends they could use my living room, but then they put their feet up on my coffee table... Not cool, man!" Pretty much the equivalent situation.
One thing they could do is tell you to stop. If they have told you to stop, and taken measures to block you out (blocking crawlers besides google) then persisting is illegal. I believe somebody got succesfully sued by facebook for continuing to scrape after facebook told them to stop. I'm not sure about the legality if you haven't been explicitly asked to stop, but as long as you are never blatant enough for them to notice, you shouldn't have any trouble.
It might be against the terms of service of the website you're crawling, which puts you in violation of the Computer Fraud and Abuse Act (i.e. you're considered to be "hacking" them).
Well however you classify it doesn't matter, what matters is how the government classifies it. There's the tragic story of Aaron Swartz who was caught up in this non-sense: https://en.wikipedia.org/wiki/Aaron_Swartz.
1) It's illegal to impersonate law enforcement online.
2) It's illegal to commit fraud by pretending to be someone else online (when, for example, using someone else's credit card or opening a bank account in their name).
3) You can be sued for things that are totally legal, such as blemishing Google's brand my impersonate them and doing things that annoy people.
I wonder if most major sites that whitelist Googlebot also have exceptions for Slurp, Bingbot, and other major search engine bots. If not, then it would be interesting to know how these other companies deal with it, or if they just politely back off.
I handle a number of sites that require a login but need Google to index their content. I verify that Googlebot requests are actually coming from a domain owned by Google. I can't imagine that I'm the only one doing that.
Not saying you should live your life in irrational fear but, even if they just think they can do something about it, that would be enough to mess up your life significantly while having no effect on them at all.
To properly allow the Googlebot to crawl your site, you usually combine checking googlebot with an IP whois lookup. This is also what Google recommends.
Say your bot misbehaves and effectively starts DOSing a site with a whole lot of pages, like a small Reddit clone or something. And say Reddit doesn't have another way to determine between your bot and the Googlebot. You have now put Reddit in a position where they have to either block the Googlebot (and possibly lose a huge pile of money in the process) or else buy up a lot more hardware and bandwidth to pay for your crawler as well. That's not cool, to put it bluntly.