Hacker News new | past | comments | ask | show | jobs | submit login

No, more people don't click, because they've taken the answer from your website and displayed it right in their search results.

It's for this reason that's I've stopped embedding micro data in the HTML I write.

Micro data only serves Google. Not my clients. Not my sites. Just Google.

Every month or so I get an e-mail from a Google bot warning me that my site's micro data is incomplete. Tough. If Google wants to use my content, then Google can pay me.

If Google wants to go back to being a search engine instead of a content thief and aggregator, then I'm on board.

I just got one of those emails for the first time about my personal site that's basically my resume. Apparently my text is small on mobile (it's not...) and some other crap

I don't get why google thinks it's acceptable to critique my site without prompting. It honestly just feels rude. They want me to do a whole bunch of micro-optimizations on a site that already works fine because it doesn't fit their standard of "high quality". I think I've gotten exactly 0 clicks from Google search results ever and I don't really ever want any.

If it were possible to get a human's attention at Google I'd start sending my own criticism their way but of course it doesn't work like that...

I was curious what it was complaining about, since https://henryfjordan.com looks great to me. I tried to run it through Google's "Mobile Friendly Test" but fetching failed [1] because your robots.txt has:

    User-agent: *
    Disallow: /
This would explain why you've gotten zero clicks from Google (or I would guess anyone else's) search results!

On the other hand, it's surprising that you would get a notification if you had crawling disabled. Did you set this robots.txt up recently?

[1] https://search.google.com/test/mobile-friendly?id=97_WUiIxx-...

(Disclosure: I work at Google, commenting only for myself)

Google seems to see robots.txt as "more what you call guidelines, than actual rules". Sites that block googlebot or all bots with robots.txt still turn up in google searches, just without a description, and are obviously still indexed.

robots.txt is a tool to control crawling, not to specify how you would like your site to be displayed (or not) in search results. If you don't want search engines to include your site, set:

    <meta name="robots" content="noindex">
while to block just Google do:

    <meta name="googlebot" content="noindex">
See https://support.google.com/webmasters/answer/93710

If Googlebot is not respecting robots.txt, and is crawling something it's been instructed not to crawl, let me know and I can file a bug?

(Disclosure: I work for Google but not on Search, speaking only for myself)

But that requires that Googlebot be allowed to crawl the page in robots.txt in the first place.

How do you tell Googlebot to not crawl your site and to not index it either?

Previously, one could use the undocumented "Noindex" directive in robots.txt, but this will be disabled soon: https://webmasters.googleblog.com/2019/07/a-note-on-unsuppor...

The bot doesn't need to crawl your site for it to be indexed; it crawls other sites that link to yours.

You can specify your index preferences in Webmaster Tools. Don't know if there's a domain-wide off switch in there, but there probably is.

Using Webmaster Tools is not a good option since it requires you register with the exact company you are probably trying to not interact with.

The blog post you link has a bunch of alternatives, but I agree they're not great. If there are a lot of webmasters who want to be able to noindex through robots.txt then making the case for adding noindex to the standard would be a good next step.

(Still speaking only for myself)

Googlebot actually used to support a noindex rule in robots.txt, but they are removing it.


Yes, that was linked above. It looks like this is part of reducing support to what's in the spec?

Oops, yep. I didn't see that context.

I sent you an email, and I'm posting it here but without identifying info:


Hi Jeff,

Thank you for your comment, I'm replying via email to send some info I'd rather not share on HN, but will post the same redacted in HN. I used to (back when starting my web-dev career) run a one man show development team of a web agency and all our development/pre-prod sites (that had to be unauthed) had robots.txt to disallow all bots, but they still popped up in Google. Searching some of the old domains in google I found an example here: http://***.***/***, and attached is an example of it showing up in a SERP and a what the robots.txt looks like (and I'm pretty sure that the robots.txt has looked like that since that page was created).

In this case it is just one page that nobody will care about, and since I'm not working on projects that are open but "robots.txt hidden" anymore I don't know if it is as bad as it used to be, but I regularly see pages with the "No information is available for this page" whose domains have robots.txt's that disallow all bots but still show up in Google.

Please let me know if I missed anything :)

Thanks for sending the screenshot! That site shows up with "no information is available for this page", which means that while robots.txt has disallowed bots from crawling it the page is still linked from other pages that do allow crawling.

The robots.txt protocol gives instructions to crawlers about how they should interact with the site. If you instead want to give instructions to indexes, use the noindex meta tag.

You're right, I was wrong about how to expect a "Disallow: /" to work. But isn't it sorta odd to have a protocol to control crawling (which is usually done to index) but (almost) require a compliant indexer to crawl all pages to comply with the indexing rules?

In this example the robots.txt has clearly told all bots to not crawl this site, but the only way to read the meta tag (or equivalent header) is to crawl the site. So I assume that in this case google either assumes that it is fine to crawl URL's that it has found elsewhere while ignoring the robots.txt or it assumes that pages disallowed by robots.txt are "open for indexing/linking", which would mean that any page both disallowed by robots.txt and which has a noindex meta tag would still show up, right?

What is the intended behavior if a page is disallowed by robots.txt and still linked by another indexed page? Will it get crawled or just assumed to be okay for indexing/linking? Is there any way to tell Google not to index/link and not to crawl?

If you have a calendar where every month links to the previous and next months, a crawler can get stuck and hammer the server. That's the kind of thing robots.txt is for.

>"more what you call guidelines, than actual rules"

they can index without scraping. It is enough that other websites have links to you site. So the google bot follows the rules in robots.txt to the letter. "no-index" is the way to stay away from google.

They can't read my no-index if they obey my robots.txt. Do they break the robots.txt to be able to read my no-index or do they assume my "Disallow: /" means I'm fine with them indexing/linking?

Without the noindex part of robots.txt (which google decided to ignore not so long ago) this is not solvable.

Oh, I just added that yesterday as a response to the email. Before that I was actually running Google Analytics but since I get basically 0 clicks it wasn't really useful.

I have a feeling the PDF viewer triggered it, cause on Mobile it defaults to showing the whole page which results in tiny text but that's easily fixed by the user so I prefer to leave it like that.

Yeah it's amazing how rapidly and rabidly they show up when the complaint is on one of their paid features like a Google cloud (GCE) post for them or a competitor, but nada on the other products. Well no it's actually not surprising.

Google cloud employees are encouraged to go on social media to get a feel for issues users are having and to make the product better.

The rest of Google has a policy of "Engineers will probably say the wrong thing if we let them talk in public"

Google has grown into a cancerous middleman.

> If Google wants to go back to being a search engine

While I understand the problems with Google scraping content, as a user these snippets help me find what I'm searching for faster. If that's all you're optimizing for, Google is fantastic. There are certainly good arguments to be made for other models, but for search, stealing content helps. I'm not advocating stealing content, I'm just saying that it produces more useful results.

How do you know that the content Google features is the best there is? If we stop clicking on sites and just rely on Google to provide us the content we'll go down a very slippery slope.

I don't really see how this problem is any different to 'how do we know the #1 search result is the best content there is?', if it provides you the information you want, then great, otherwise you load #2.

Google lends the weight of its authority to the answers it presents. It's one thing if Infowars says that Obama is planning a coup against Donald Trump, it's another if Google says so.

Try googling "root M89 tablet".

The first three result lead you to fake android blog telling you how you can easily root every chinese android device and specifically the M89 tablet...

The real authoritative result (xda-developers) only appears in the fourth position, under sight. It will tell you if you follow the instruction given in the fake blog post from the 2 or 3 first results, you will brick your tablet.

In a similar way the word "cbd" (for cannabidiol) has been hijacked by dubious commercial compagnies through fake blog posts filling pages after pages of google results telling you how great cbd is for the treatment of every disease on earth... But there is no trace of an actual study in these results. You will have to go with the less popular word "cannabidiol" to start to see some serious articles about it.

Google results can be hijacked and Google do little about it. May be because the ads shown in these fake blog posts are from google ads network ? I don't know...

But google result have clearly deteriorated these last years and the authoritative figure of the companie is not anymore what it was in the past.

I know that sort of thing happens sometimes (Google presenting a spurious statement as a categorical answer) but those are bugs. As long as they are very rare, and fixed quickly when they occur, I don’t see them causing much harm.

OK, some people believe anything they read (especially if it confirms their existing biases), but that problem has always existed. I think Google’s occasional snippet fuck-ups are a drop in the ocean compared to the spread of false information through social networks.

There's the modern news-cycle axis, where Google can and should devote full-time engineers.

But the long tail is important too. It's fixed now (yay) but for years you could search for "calories in corn" and Google would confidently present an answer 5x the true value, scraped from a site with profoundly wrong information. As Google moves to present more direct answers and fewer links, this risk increases.

It looks like they have backed off on the direct answers somewhat which is good news.

If it undermines the websites producing the content Google is scraping by not sending through traffic then those sites may not continue to exist.

This is already happening.

Very few new blogs and content websites are being set up.

All content is moving into apps and walled gardens. Part of the reason for that is that running a well researched blog will never pay for your time, so becomes a hobby thing, and most people are fine to use Facebook for that.

> Micro data only serves Google. Not my clients. Not my sites. Just Google.

Well it also serves Google's users, to be clear. Though I should also be clear that I don't think that justifies it, since I think it's bad for the ecosystem in more subtle ways than are expressed in immediate user satisfaction.

That depends on how you define "users". If you define a website creator also as a Google user (by virtue of wanting to be found through Google), then Google is serving part of its users to the detriment of their other users.

And if you view Google instead as a connection broker, e.g. a middle-man between publisher and consumer, then Google is destroying their own business by snubbing publishers. Assuming that Google is still making rational, intelligent decisions, it follows that Google no longer sees itself like that.

Did Google ever see itself as prioritizing publishers and consumers equally? I think that’s a false premise and the parent is right; Google’s priority has always been consumer first.

> If Google wants to go back to being a search engine instead of a content thief and aggregator

A search engine is inherently a content aggregator; the functions are inseperable.

Not necessarily. Google used to be more of a link aggregator. There's a difference, as the OP proves.

Google (and virtually every other search engine) has always included content with links, what's different now (but not unique to Google, though they are perhaps the most advanced at it) is that now it algorithmically synthesizes content instead of merely aggregating it.

It does help your clients.

I mean, maybe not yours specifically. But snippets are great for users in the typical case.

These users are no longer his clients.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact