Say your bot misbehaves and effectively starts DOSing a site with a whole lot of pages, like a small Reddit clone or something. And say Reddit doesn't have another way to determine between your bot and the Googlebot. You have now put Reddit in a position where they have to either block the Googlebot (and possibly lose a huge pile of money in the process) or else buy up a lot more hardware and bandwidth to pay for your crawler as well. That's not cool, to put it bluntly.
A more robust solution can be coded using information from Google: https://support.google.com/webmasters/answer/80553?hl=en
Most important argument: the chrome user-agent contains the word 'mozilla'. Obviously (we argue) google isn't intending these to be accurate and instead are some kind of compatibility mark.
Are you committing trademark violation? Given the nature of trademarks, it's not clear that you are.
Are you misrepresenting yourself to the site in a way that violates the CFAA? This is probably your biggest area of risk. But you can argue the site is giving away information to google, a company whose slogan until recently was 'free the world's information'. Therefore they weren't taking plausible steps to secure the information you've scraped.
I know of a few sites that use this as the first step (of many!) to add bots to their "naughty" list.
However, consider what your ultimate end game is, if it's a website you expect visitors to find through Google or the Play store, good luck once web masters start reporting your misbehaving "Googlebot" crawler.
What's the reverse DNS for Google Cloud IPs? Google says to check that Googlebot's IP resolves to either a .google.com or .googlebot.com domain.
Like this: http://i.imgur.com/ocR54Yq.jpg
In my home country, it's actually quite interesting: fraud usually requires (a) a lie (conveying wrong information with intent), and (b) a financial cost to the other party, and (c) a financial gain for you.
It's debatable at that level, already, because their loss is rather hard to quantify, and probably small. Plus, I believe your financial gain must be directly related to their cost.
And, finally, you actually have to lie to a human being. Lying to a machine doesn't qualify. There was a guy who earned some 5-digit Euros amount by producing fake bottles and feeding them into deposit machines–no crime!
As for the site owner, it's on them to decide what to do with your traffic. HTTP is an open protocol and extensible. You could send almost anything in your request, as allowed by the protocol. The site owner has opened their service to the HTTP protocol and it's on them to decide what to do with your traffic.
How would Google know? They would start by setting up fictitious websites which would be seemingly unaffiliated to them. If your crawler was to hit the site, you would thus reveal yourself. I wouldn't at all be surprised that Google would have this kind of "honey pot" of sorts sitting out there watching for web crawlers (rogue or otherwise).
Google likely also has business partner relationships with big content producers, which I'm sure they are able to get reports back from regarding their crawling -- to ensure that Google is correctly finding all the content which the site owners want them to.
As an aside, I used to run such a honeypot website. Web crawler behavior is fascinating. I loved being able to find, detect and classify various forms of web crawlers. Some which followed robots.txt, some that didn't, some that went directly to robots.txt and then scraped the pages which were meant to be excluded. I wish I had kept the project going and formalized the results.
If the sites in question only add an exception for googlebot and not other crawlers (e.g. Yahoo, bing, etc.) I would say that it is against the site owner's consent.
However if the site owner adds this exception also for other crawlers, you could argue that the site owner's intent of only allowing certain crawlers has not been made explicit. In that case you'd have a chance against the claims from the site's owner.
On the other hand Google could possibly sue you for using the user-agent "Googlebot".
The important question here is: would they? If you stay under the radar no one -even the courts- would bother.
PS: I am only a law student, I am not familiar with any laws/regulations/precedents governing this specific issue. I think from the site owner's perspective it's a grey area. From google's perspective brands and ip are established concepts in law. This is a student's very personal opinion at first sight, take it with a grain of salt :).
Genuinely curious: on what basis? Can you trademark (or similar) a user-agent?
I am not familiar with U.S. law. If the courts in U.S. adopt a strictly formalistic approach on using names in a business context, then the results may be different.
However IMHO, Googlebot is clearly associated with Google and anyone who uses Googlebot as their User Agent is tricking the sites owner's into believing that the request was made from Google.
As for legal grounds, Google could sue on unfair competition.
"I left my door unlocked and told my friends they could use my living room, but then they put their feet up on my coffee table... Not cool, man!" Pretty much the equivalent situation.
2) It's illegal to commit fraud by pretending to be someone else online (when, for example, using someone else's credit card or opening a bank account in their name).
3) You can be sued for things that are totally legal, such as blemishing Google's brand my impersonate them and doing things that annoy people.