1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?
2. The linked page says to avoid Disallow in robots.txt and to rely on the noindex tag. But how can I prevent googlebot from crawling all user profiles to avoid database hits, bandwidth, etc. without an entry in robots.txt? With noindex, googlebot must visit each user profile page to see that it is not supposed to be indexed.
"Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it."
It's counterintuitive but if you want a page to never appear on Google search, you need to flag it as noindex, and not block it via robots.txt.
> 1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?
It'll usually list the URL with a description like "No information is available for this page". This can happen for example if the page has a lot of backlinks, it's blocked via robots.txt, and it's missing the noindex flag.
That it's being scoffed at on Hacker News ought to tell you how uncreative and trifling it is. It's a script kiddie stunt not remotely worthy of being considered actual hacking.
I find the scraping explanation plausible. Some search engine bots are aggressive. With all the AI hype, I first thought of Microsoft Bing scraping Twitter at full datacenter speed to suck in more information for OpenAI.
I would believe that scraping is why they now require users to be authenticated.
But given that they now require users to be logged in, it should be computationally cheap to drop unauthenticated requests at the front door before they incur real expense.
It'd also be cheap to just blackhole datacentre IP space.
The sort of attack that would require this level of limits is malware on tens of thousands of residential machines that can use a user's existing Twitter session cookies. I'm really skeptical that's the case.
How do you explain it suddenly being a problem today and not, say, during the recent World Cup when not only the AI scraping would have been happening but el Morko himself was crowing about how much extra traffic they were handling?
Actual human users are hitting rate limits under 10 minutes because every Tweet loaded counts towards the rate limit. This is like setting your house on fire at the sight of a few mosquitos.
Another case?! On New Year’s Eve, a worker got sucked into an engine and died. Even though the pilots announced they’d have one engine running and there were safety briefings minutes prior.
1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?
2. The linked page says to avoid Disallow in robots.txt and to rely on the noindex tag. But how can I prevent googlebot from crawling all user profiles to avoid database hits, bandwidth, etc. without an entry in robots.txt? With noindex, googlebot must visit each user profile page to see that it is not supposed to be indexed.
reply