It's for this reason that's I've stopped embedding micro data in the HTML I write.
Micro data only serves Google. Not my clients. Not my sites. Just Google.
Every month or so I get an e-mail from a Google bot warning me that my site's micro data is incomplete. Tough. If Google wants to use my content, then Google can pay me.
If Google wants to go back to being a search engine instead of a content thief and aggregator, then I'm on board.
I don't get why google thinks it's acceptable to critique my site without prompting. It honestly just feels rude. They want me to do a whole bunch of micro-optimizations on a site that already works fine because it doesn't fit their standard of "high quality". I think I've gotten exactly 0 clicks from Google search results ever and I don't really ever want any.
If it were possible to get a human's attention at Google I'd start sending my own criticism their way but of course it doesn't work like that...
On the other hand, it's surprising that you would get a notification if you had crawling disabled. Did you set this robots.txt up recently?
(Disclosure: I work at Google, commenting only for myself)
<meta name="robots" content="noindex">
<meta name="googlebot" content="noindex">
If Googlebot is not respecting robots.txt, and is crawling something it's been instructed not to crawl, let me know and I can file a bug?
(Disclosure: I work for Google but not on Search, speaking only for myself)
How do you tell Googlebot to not crawl your site and to not index it either?
Previously, one could use the undocumented "Noindex" directive in robots.txt, but this will be disabled soon: https://webmasters.googleblog.com/2019/07/a-note-on-unsuppor...
You can specify your index preferences in Webmaster Tools. Don't know if there's a domain-wide off switch in there, but there probably is.
(Still speaking only for myself)
Thank you for your comment, I'm replying via email to send some info I'd rather not share on HN, but will post the same redacted in HN. I used to (back when starting my web-dev career) run a one man show development team of a web agency and all our development/pre-prod sites (that had to be unauthed) had robots.txt to disallow all bots, but they still popped up in Google. Searching some of the old domains in google I found an example here: http://***.***/***, and attached is an example of it showing up in a SERP and a what the robots.txt looks like (and I'm pretty sure that the robots.txt has looked like that since that page was created).
In this case it is just one page that nobody will care about, and since I'm not working on projects that are open but "robots.txt hidden" anymore I don't know if it is as bad as it used to be, but I regularly see pages with the "No information is available for this page" whose domains have robots.txt's that disallow all bots but still show up in Google.
Please let me know if I missed anything :)
The robots.txt protocol gives instructions to crawlers about how they should interact with the site. If you instead want to give instructions to indexes, use the noindex meta tag.
In this example the robots.txt has clearly told all bots to not crawl this site, but the only way to read the meta tag (or equivalent header) is to crawl the site. So I assume that in this case google either assumes that it is fine to crawl URL's that it has found elsewhere while ignoring the robots.txt or it assumes that pages disallowed by robots.txt are "open for indexing/linking", which would mean that any page both disallowed by robots.txt and which has a noindex meta tag would still show up, right?
What is the intended behavior if a page is disallowed by robots.txt and still linked by another indexed page? Will it get crawled or just assumed to be okay for indexing/linking? Is there any way to tell Google not to index/link and not to crawl?
they can index without scraping. It is enough that other websites have links to you site. So the google bot follows the rules in robots.txt to the letter. "no-index" is the way to stay away from google.
Without the noindex part of robots.txt (which google decided to ignore not so long ago) this is not solvable.
I have a feeling the PDF viewer triggered it, cause on Mobile it defaults to showing the whole page which results in tiny text but that's easily fixed by the user so I prefer to leave it like that.
The rest of Google has a policy of "Engineers will probably say the wrong thing if we let them talk in public"
While I understand the problems with Google scraping content, as a user these snippets help me find what I'm searching for faster. If that's all you're optimizing for, Google is fantastic. There are certainly good arguments to be made for other models, but for search, stealing content helps. I'm not advocating stealing content, I'm just saying that it produces more useful results.
The first three result lead you to fake android blog telling you how you can easily root every chinese android device and specifically the M89 tablet...
The real authoritative result (xda-developers) only appears in the fourth position, under sight. It will tell you if you follow the instruction given in the fake blog post from the 2 or 3 first results, you will brick your tablet.
In a similar way the word "cbd" (for cannabidiol) has been hijacked by dubious commercial compagnies through fake blog posts filling pages after pages of google results telling you how great cbd is for the treatment of every disease on earth... But there is no trace of an actual study in these results. You will have to go with the less popular word "cannabidiol" to start to see some serious articles about it.
Google results can be hijacked and Google do little about it. May be because the ads shown in these fake blog posts are from google ads network ? I don't know...
But google result have clearly deteriorated these last years and the authoritative figure of the companie is not anymore what it was in the past.
OK, some people believe anything they read (especially if it confirms their existing biases), but that problem has always existed. I think Google’s occasional snippet fuck-ups are a drop in the ocean compared to the spread of false information through social networks.
But the long tail is important too. It's fixed now (yay) but for years you could search for "calories in corn" and Google would confidently present an answer 5x the true value, scraped from a site with profoundly wrong information. As Google moves to present more direct answers and fewer links, this risk increases.
It looks like they have backed off on the direct answers somewhat which is good news.
Very few new blogs and content websites are being set up.
All content is moving into apps and walled gardens. Part of the reason for that is that running a well researched blog will never pay for your time, so becomes a hobby thing, and most people are fine to use Facebook for that.
Well it also serves Google's users, to be clear. Though I should also be clear that I don't think that justifies it, since I think it's bad for the ecosystem in more subtle ways than are expressed in immediate user satisfaction.
And if you view Google instead as a connection broker, e.g. a middle-man between publisher and consumer, then Google is destroying their own business by snubbing publishers. Assuming that Google is still making rational, intelligent decisions, it follows that Google no longer sees itself like that.
A search engine is inherently a content aggregator; the functions are inseperable.
I mean, maybe not yours specifically. But snippets are great for users in the typical case.