Google has been very clear lately (via John Mueller) regarding getting pages indexed or removed from the index.
If you want to make sure a URL is not in their index then you have to 'allow' them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.
In fact, I've seen plenty of pages still rank well despite the page being disallowed in robots.txt. A great example of this is the keyword "backpack" in Google. You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).
> However, you will not see any information like the meta description on these blocked URLs.
True, but that's not the only thing. If it ever was in the index, it takes forever to be removed, if it gets removed at all. Send 404 or 410, Disallow it or set it to noindex - you may get lucky or you may not. You can of course "hide it from search results", but that only works for 90 days (iirc, may be 120, something in that range). Those leftovers will typically lose rankings, but they often stay indexed, easy to spot with a site: query.
Reindexing a page is dynamic based on noteworthiness and volatility iirc, but individual links can be reindexed on the fly since the Percolator index. The 90d number was from an old system when indexes were broken into shards that had to be swapped out wholesale.
I don't mean reindexing, I mean "hiding from the index" ("Remove URLs" in GSC). It works instantly, but only for a limited time, after which it will re-appear in the index if you haven't gotten it out of the index (via 410, noindex or disallow). Since these other ways don't always work, if you're unlucky and want it to stay gone, you need to hide it again (and again and again). I've had clients that were hacked and had spammy content injected into their site and it took (literally!) years for that to get removed (we tried combinations of 404, 410, noindex and disallow).
Exactly, there is no guaranteed way to remove anything, HTTP status, meta-tags, headers, and robots.txt only have advisory status. They are usually followed when a resource is hit first, but once it's in the index, "keeping the result available" seems to be a top priority. I do understand the idea - it might still be a useful result for a user, but otoh if it's 410 (or continuously 404), it won't be of any use because the content that was indexed is no longer available (especially in case of 410).
Granted, these are edge cases, in most circumstances, 410 + 90 day hiding means they are hidden instantly and don't resurface. These edge cases do make me take Google's official statements on how to deal with things with a grain of salt though: bugs exist, and unless you happen to know somebody at Google there's no way to report them.
No, disallow means that you are not allowed to crawl the page. You have to crawl the page to no you cannot index it. But how do you index it if you do not crawl the page, well if another page that you can crawl and index points to the page you cannot index as authoritative on a keyword then it be in the index with that keyword, even if you do not have the actual crawled content of the page.
Not if you buy the government first. I think they’re already a victim of their own greed though.
It’s bad enough I started using DDG for search because the results are now more relevant. Google’s advertising algorithms are designed to subtly nudge sites into paying for placement — which means there’s a “non-content” element to the search results that makes it into the user experience. I feel like there was a tipping point a year or two ago where the results just stopped being useful — The best analogy I can find is how search engines used to be in the days before AltaVista. Then AltaVista came out and the results were far more relevant (if not perfect). Google -> DDG feels like that in 2019.
That “non-content” element will only grow over time as Google seeks revenue growth — growth across all of Google’s non-advertising revenue streams combined are not enough to move the needle compared to the scale their ad business has — of which search ads are by far the most profitable. So they will further try to monetize search; it’s their cash cow but I think a small player like DDG could easily overtake them as the quality of Google’s search results (to the end user) continue to decline.
Agreed re: DDG search quality. It's my own default and preferred choice. Google remains useful for Scholar and Books, but relevance and deceptive ads on SERPS is rapidly declining.
It's like recommending a book you haven't read, and newspapers do that every day.
Basically Google finds the link in other places -> oh that must be interesting, I'm indexing it, without even reading it. So they don't have the actual content, and just use the texts from the sites that link to it.
But they do have the actual content since they show the meta title an description on top of what I assume is heavy NLP to drive the search engine itself.
Usually, creating "410 Gone"[0] response for the URL and running the URL through the URL Inspection Tool [1] can help make things a bit faster. But yeah, it does take a while to get these 404s removed.
That distinction exists in many systems. E.g. for cloud events, 404 is considered with skepticism because it could be a race condition in provisioning or transient issue whereas 410 requires data streams to be cut off.
5xx means that the server made a mistake. 4xx means that the caller made a mistake. Sending a request to a GONE url is canonically classified as a “user” or sender error.
If you want to make sure a URL is not in their index then you have to 'allow' them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.
In fact, I've seen plenty of pages still rank well despite the page being disallowed in robots.txt. A great example of this is the keyword "backpack" in Google. You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).