Hacker News new | past | comments | ask | show | jobs | submit | csiegert's comments login

I’ve got two questions:

1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?

2. The linked page says to avoid Disallow in robots.txt and to rely on the noindex tag. But how can I prevent googlebot from crawling all user profiles to avoid database hits, bandwidth, etc. without an entry in robots.txt? With noindex, googlebot must visit each user profile page to see that it is not supposed to be indexed.


https://developers.google.com/search/docs/crawling-indexing/...

   "Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it."
It's counterintuitive but if you want a page to never appear on Google search, you need to flag it as noindex, and not block it via robots.txt.

> 1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?

It'll usually list the URL with a description like "No information is available for this page". This can happen for example if the page has a lot of backlinks, it's blocked via robots.txt, and it's missing the noindex flag.


'But how can I prevent googlebot from crawling all user profiles to avoid database hits..'

If user profiles are noindexed then why should you care if google are crawling, when almost every other crawler out there does not obey robots.txt?

It's not in google's interest to waste resources on non-indexable content, you are worrying far too much about it.


The developer said it’s not adult content.

https://www.reddit.com/r/SaaS/comments/1dfwg1i/comment/l8mlm...


Your unit is wrong. Giga means billion. The universe is ~14 Gyr old.


You're entirely correct, my mistake; unfortunately I can't edit the comment any more.


Mail the device to Spotify’s headquarters. Better, a Swedish artist should build a memorial made of these devices in front of Spotify’s headquarters.


The About page (link top right) says “Loadership is a side project of Jingcheng Chen” and links to https://chen.works/


The activity indicators look nice and the configurator is great! Bookmarked.


It’s GitHub, not HackerHub. That the story is reported on Hacker News is irrelevant.


But I would expect a different sentiment of comments on a site called Hacker News.


That it's being scoffed at on Hacker News ought to tell you how uncreative and trifling it is. It's a script kiddie stunt not remotely worthy of being considered actual hacking.


I find the scraping explanation plausible. Some search engine bots are aggressive. With all the AI hype, I first thought of Microsoft Bing scraping Twitter at full datacenter speed to suck in more information for OpenAI.


I would believe that scraping is why they now require users to be authenticated.

But given that they now require users to be logged in, it should be computationally cheap to drop unauthenticated requests at the front door before they incur real expense.

It'd also be cheap to just blackhole datacentre IP space.

The sort of attack that would require this level of limits is malware on tens of thousands of residential machines that can use a user's existing Twitter session cookies. I'm really skeptical that's the case.


These limits are far below standard scraping rates and deeply affecting the casual users, have to presume intentionally.


> I find the scraping explanation plausible.

How do you explain it suddenly being a problem today and not, say, during the recent World Cup when not only the AI scraping would have been happening but el Morko himself was crowing about how much extra traffic they were handling?


Someone with deep pockets flipped the ON switch?


The much simpler explanation is that elon lied again.


Actual human users are hitting rate limits under 10 minutes because every Tweet loaded counts towards the rate limit. This is like setting your house on fire at the sight of a few mosquitos.


Another case?! On New Year’s Eve, a worker got sucked into an engine and died. Even though the pilots announced they’d have one engine running and there were safety briefings minutes prior.


What’s a cheap and reliable registrar? GoDaddy bought my registrar and renewal prices are about to double for me.


Porkbun! porkbun.com


Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: