I think I know what the problem is; we're detecting HN as a dead page. It's unclear whether this happened on the HN side or on Google's side, but I'm pinging the right people to ask whether we can get this fixed pretty quickly.
Added: Looks like HN has been blocking Googlebot, so our automated systems started to think that HN was dead. I dropped an email to PG to ask what he'd like us to do.
A site can be crawled from any number of Googlebot IP addresses, and so blocking all except one doesn't help in throttling crawling.
If you verify the site in Webmaster Tools, we have a tool you can use to set a slower crawl rate for Googlebot, regardless of which specific IP address ends up crawling the site.
Let me know if you need more help.
Edit Detailed instructions to set a custom crawl rate:
1. Verify the site in Webmaster Tools.
2. On the site's dashboard, the left hand side menu has an entry
called Site Settings. Expand that and choose the Settings submenu.
3. The page there has a crawl rate setting (last one). It defaults to
" Let Google determine my crawl rate (recommended)". Select
"Set custom crawl rate" instead.
4. That opens up a form and choose his desired crawl rate in crawls per second.
If there is a specific problem with Googlebot, you can reach the team as follows:
1. To the right hand side of the Crawl Rate setting is a link called
"Learn More". Click that to open a yellow box.
2. In the box is a link called Report a problem with Googlebot which
will take you to form you can fill out with full details.
Instead of completely disregarding Crawl-Delay, why not support it up to a maximum value that is deemed sensible? This would prevent people from completely shooting themselves in the foot, and it would surely be better than completely disregarding it.
I would think that the number of people who (a) know how to create a valid robot.txt file, (b) have some idea of how to use the "crawl-delay" directive and (c) write a "shoot-themselves-in-the-foot" worthy error is vanishingly small.
As opposed to --for illustrative purposes-- the vanishingly small number of people who know how to block IP addresses and manage to get their site to disappear from Google's listings?
A few years ago, I did pretty much the same thing myself. Thankfully the late summer was our slow season and the site recovered pretty quickly from my bone-headed move, but the split second after I realized what I've done was bone-chilling.
I think just about everyone has thought at some point that they understood how something worked, only to have had things go pear-shaped on them.
The lesson: people are not fully knowledgeable about everything, even the smart and talented ones.
"You would not believe the sort of weird, random, ill-formed stuff that some people put up on the web: everything from tables nested to infinity and beyond, to web documents with a filetype of exe, to executables returned as text documents. In a 1996 paper titled "An Investigation of Documents from the World Wide Web," Inktomi Eric Brewer and colleagues discovered that over 40% of web pages had at least one syntax error".
We can often figure out the intent of the site owner, but mistakes do happen.
The number of webpages with HTML that's just plain wrong (and renders fine!) is staggering. I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.
The downside is maintainability. If your website follows the rules, you can be pretty confident that any weird behaviour you see is a problem with the browser (which is additional context you can use when googling for a solution). If your website requires browsers to quietly patch it into a working state, you have no guarantees that they'll all do it the same way and you'll probably spend a bunch of time working around the differing behaviour.
Obviously, that's not a problem if you already know exactly how different browsers will treat your code, or you're using parsing errors so elemental that they must be patched up identically for the page to work. For example, on the Google homepage, they don't escape ampersands that appear in URLs (like href="http://example.com/?foo=bar&baz=qux — the & should be &). That's a syntax error, but one that maybe 80% of the web commits, so any browser that couldn't handle it wouldn't be very useful.
It's interesting that you apparently actually checked a validator to get the error count, and yet the two things you cite as errors are not errors, have never been errors, and are not listed in the errors returned by the validator. Both opening and closing tags for the html, body, and head elements are optional, in all versions of HTML that I am aware of (outside of XHTML, which has never been seriously used on the open web as it isn't supported by IE pre-9). There is a tag reported unclosed by the validator, but that's the center tag.
Anyhow, one downside to having syntax errors might be that parsers which aren't as clever as those in web browsers, and which haven't caught up with the HTML5 parser standard, might choke on your page. This means that crawlers and other software that might try to extract semantic information (like microformat/microdata parsers) might not be able to parse your page. Google probably doesn't need to worry about this too much; there's no real benefit they get from having anyone crawl or extract information from their home page, and there is significant benefit from reducing the number of bytes as much as possible while still remaining compatible with all common web browsers.
I really wish that HTML5 would stop calling many of these problems "errors." They are really more like warnings in any other compiler. There is well-defined, sensible behavior for them specified in the standard. There is no real guesswork being made on the part of the parser, in which the user's intentions are unclear and the parser just needs to make an arbitrary choice and keep going (except for the unclosed center tag, because unclosed tags for anything but the few valid ones can indicate that someone made a mistake in authoring). Many of the "errors" are stylistic warnings, saying that you should use CSS instead of the older presentational attributes, but all of the presentational attributes are still defined and still will be indefinitely, as no one can remove support for them without breaking the web.
Considering it's Google and we're talking about almost the whole population of the Earth, the vanishingly small percentage of the entire population of the planet would still be at least hundreds of thousands of people.
I do understand the thought but I think it is not a good gesture to do. You could always cap crawl-delay at a reasonable maximum and additionally allow people to fix mistakes through the webmaster tools (eg if they told your bots to stay away for a long time but in the meantime want to revert that).
Maybe instead that hostload could be parsed from robots.txt? It sure seems like the better mechanic to tweak for load issues (while traffic/bandwidth issues are still unresolved).
Could google vary the crawling rate on each site and see what effects that has on response times, and develop an algorithm to adjust crawl speed so as not to affect site performance too much? If google starts crawling a site and notices sequential crawl requests are answered in .5s w/ .1s stddev and it starts crawling with 10 parallel connections and the answers are 2s w/ 1s stddev, clearly that's a problem because user experience for real people will be impacted. Maybe google could automatically email webmaster@ and notify them of performance issues it sees when crawling.
Another thing that might help google is for them to announce and support some meta tag that would allow site owners (or web app devs) to declare how likely a page is to change in the future. Google could store that with the page metadata and when crawling a site for updates, particularly when rate limited via webmaster tools, it could first crawl those pages most likely to have changed. Forum/discussion sites could add the meta tags to older threads (particularly once they're no longer open for comments) announcing to google that those thread pages are unlikely to change in the future. For sites with lots of old threads (or lots of pages generated from data stored in a DB and not all of which can be cached), that sort of feature would help the site during google crawls and would help google keep more recent pages up to date without crawling entire sites.
Have you considered putting a caching reverse proxy in front of the arc app to keep the backend from having to render all of the old pages?
The conspicuous lack of a "Server:" header inclines me to believe that that's probably not the case (most web servers set one indicating the server software and version). Here are the headers that HN sends out from an old post (20 days ago):
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
By ensuring that your pages are valid, you make it ever so much more likely that you will not have to scramble around wasting time at a most inopportune time when the new version of a browser comes out which handles your non-standards compliant tag soup differently than the current version of the browser.
So, do you want to pay the price upfront when you can plan for it or afterwards when the fix must be done immediately because customers are complaining?
Why don't you bother doing your real work right the first time? As long as there's a well defined spec, you might as well follow it instead of being creative and original when it comes to implementing standards.
We use varnish for caching and check the useragent for requests.
If the cache has a copy of an article that is a few hours old it will just give that version to Googlebot while if it thinks a human is requesting the page then it will go to the backend and fetch the latest version.
This is why I don't understand the anti-promotion crowd who think promoting oneself and building an audience is a bad thing. Having the implicit threat of an audience you can address is a major lever to getting decent service nowadays.
I'm certainly not against promoting myself where I think some attention is merited, but ultimately that kind of thing is close to a zero sum game in that the amount of 'famous' people is fairly limited.
In the lower-right corner of Google Maps there is a tiny link that says, "Edit in Google Map Maker". Click this link and you can edit Google Maps. Your edits get sent to Google and they'll approve/deny it in typically a few days.
it's a verified listing (do you know hard it is to convince the IT dept to take their automated phone system offline so i could verify a google maps listing? It was nuts and no google didn't offer the postcard method) so i don't see why i have to enter the same info again, but i did.
the listing only shows up if you type the exact name of the hospital into the search bar, which is useless.
Does that mean at Google you can manually set the system to treat different sites differently?
I don't mean the ranking but other aspects - like you guys blacklisted some domains which produce low quality content in wholesale. (I don't know if the algorithm was tweaked to detect and filter such sources or if it was a manual thing.)
Nope. As mentioned above, apparently Google thinks people will "shoot themselves in the foot" with the crawl-delay directive, while they won't with Google's special interface (which requires registering and logging in).
What more can you expect from the World's Largest Adware Company?
Seriously, one thing about Google is that they seem to really like ensuring people are logged on, preferably at all times. Fortunately recent changes to Google Apps (promoting apps user accounts to full Google accounts) has made this more complex on my side and probably degraded the level of actionable info they can get out of it.
Seeing threads like this remind me that HN is still a pretty tight-knit community of real people doing real things.
It's good to see this stuff sometimes. Thanks, Matt!
And then reading some of the other threads on this topic is a bit...something.
Guys, can you calm the conspiracy theory nonsense a bit? Please?
If you're not on this site very much, you might not realize that Matt pops into almost every thread where google is doing something strange regardless of who they're doing it to, and tries to help figure out what is happening. This isn't HN getting some sort of preferential treatment, this is just the effect of having a userbase full of hackers.
You'd see the same type of thing on /. years ago if you frequented it enough.
This is nothing new. This is what a good community looks like. Everybody relax.
Honestly if you read the things that Matt and Pierre have said, they just looked at "freshness" (I believe that is what it is called), and inferred that PG had blocked their crawlers.
This is all stuff you can get from within google webmaster tools (which isn't some secret whoooo insider google thing. It's something they offer to everybody, and it's just like analytics.)
OH! Wait! I mean (hold on, let me spin up my google conspiracy theory generator): thehackernews.com has more ads on it so google is intentionally tweaking their algo to serve that page at a higher point than the real HN because of ads!
C'mon, guys, look at their user pages. They're both just active users of the site trying to help out.
"This isn't HN getting some sort of preferential treatment, this is just the effect of having a userbase full of hackers"
Of course it's preferential treatment. And if you scan the last month or two of Matt's comments they are general in nature and not specific as in:
"I think I know what the problem is; we're detecting HN as a dead page. It's unclear whether this happened on the HN side or on Google's side, but I'm pinging the right people to ask whether we can get this fixed pretty quickly."
You don't think "pinging the right people" and "get this fixed pretty quickly" is preferential treatment?
He's done the same thing for nearly everyone that's asked about something Google-related here. So yes, everyone on HN gets special treatment.
Answering people's individual questions doesn't scale to the entire Internet, so Google really has no choice but to address problems on a case-by-case basis. In this case, Matt reads HN and personally wants to solve the problem. That's the only way Google could possibly work, so that's how they do it.
The preferential treatment is kind of annoying TBH. I've seen sites disappear from Google's listing for months as a result of simple issues such as a change of domain name (with all the required redirects in place).
Yet here, a website owner is purposely blocking the crawler and they jump with solutions to try to fix the problem. Sigh.
All these years, you thought you were working with machines, and finally it turns out coding is a "people business". Did they change, or did you? Is quality less important than networking now because social networks make quality the equivalent of good networking? Or because of some corruption in the pure bootstrap capitalism of code=web=money=power?
I don't know the answers to these questions, but I am concerned about being told to accept the output of raw logic and algorithms as holy writ, only being told a bit later that everything's negotiable if you have a personal relationship.
It really is but I was personally kind of glad not to see HN have that high of a ranking on Google for the term.
I remember when I first started visiting HN I saw all these smart people and the tight community and I was amazed that something that felt so close-knit and exclusive yet was still open could still exist these days.
I was a lurker for a long time before I actually signed up and participated because I honestly felt like I swasnt entitled to be part of "the group" and I should somehow earn my wings. Then in late 2010 I signed up but didn't submit for a bit and didn't join discussions. I still felt like I didn't have enough to offer. I now feel like I've somehow earned the right to be part of this community though in hindsight I'm quite embarrassed of my first few submissions.
So this story does have a point that I'm about to get to. I first heard of HN through an article in GQ and then forgot the link. I couldn't find the site again after searching Google for "Hacker News" as easily as I thought. This frustrated me slightly back then but now I think it's a good thing.
As the size of a community gets larger the quality of comments and submissions usually decreases. Letting people join HN freely and openly is a great thing but I fear that if it became a huge sensation then we'd be inundated by garbage submissions and comments way more frequently. I know about the post on how newbies often say HN is becoming Reddit and all that so I do try to remember that.
So the point is that not everyone respects communities like this and are thoughtful about joining and how they choose to interact on communities like HN the same way I was and I feel like maybe it's okay if Google isn't giving us the best ranking for certain terms. I mean, HN is easy to find still, just not that easy to stumble over.
We do include some information in the search results for cases where our users could be harmed, e.g. if we detect malware or if we think that your site may have been hacked. We've been exploring other ways to alert site owners that there's something important affecting their site.
But people do choose to remove their sites, so we can't always tell between a mistake vs. someone who genuinely prefers not to be in Google's index.
Excellent contribution... Ugh, this is why I don't mind if HN is listed on Google or not.
So which part is laughable? The part where they're able to crawl the vast expanses of the web and return relevant results for the majority of their users? Or is it the part where they came out of nowhere to dominate search because they did it better than the rest?
Come on now, you can't be all things to all people. Google is far from perfect but for a lot of us it's much closer to perfect than the competition and they're constantly trying to improve it. Why don't you go ask Matt Cutts to fix whichever parts of it you think are laughable to your liking? He's been hanging around here and he doesn't seem shy about answering people's questions and concerns. I do doubt he'd give the time of day to a one sentence remark that adds nothing of value whatsoever to the larger discussion or any of it's offshoots.
No, the part where instead of getting better, Google search keeps getting worse. The verbatim mode is finally there but much of the time when it mucks with my search terms it's wrong. The UI is also cluttered. They are doing a lot of things that do nothing to improve the quality of the search, so when I want to learn something, I often skip Google and go straight to Stack Overflow or Wikipedia. I did jump the gun on this issue, though...turns out some googlebot blocking was involved.
Does anyone else find it disturbing that google employees are bending over for pg/hn? Seriously, if any other webmaster blocked googles bots they wouldn't change their algorithms to accommodate, or see how they could use less of our resources.
Its pg's fault not googles, and I dont see why they should care. Maybe from their standpoint it would be more beneficial to google users who are used to typing in 'hacker news' to visit this site, but since when did that matter to google?
Also don't get me wrong I love both google and hackernews. I just find whats going on in this thread interesting..
From my personal point of view, I help webmasters when I see an issue. It's part of my job and it's something that gets me fired up. I help when I go on our forums, I do that here, and I'm regularly helping webmasters on Google+ (my team does regular webmaster hangouts that anyone can join - see my profile: https://plus.google.com/115984868678744352358 ) and also on Twitter. We can't cover everything, we can't be everywhere, but when we come across something we can help with, we try to help!
Another important point about this thread: this is a very common issue that regularly comes up, and no site is immune from it. I regularly see major sites have a firewall that auto-configures itself to block Googlebot, and the webmaster doesn't know what's going on. Raising awareness about this problem and how to fix it adds to the importance of replying.
It's not disturbing at all. I am glad Google is a company made up of humans and not uncaring robots. Human nature is that people want to help others when they are in a position to do so. I would do the same if HN needed my help.
Of course, there are people who want to take advantage of people like Matt to give themselves an economic advantage, and I have no sympathy for those people if they can't get someone in Google to help them as easily.
Actually, much as I've blasted Google over the years for not doing anything (like abuse-desk staffing) that requires People Not Algorithms...
Giving occasional personalized help is a great way to avoid a total black hole while still creating an escalation path. It's like having EXPLAIN QUERY stack traces on NewRelic. You don't need a deep trace of every transaction; you need a deep trace from a few representative squeaky wheels, and you use that information to improve the underlying self-serve process. This is something Google, with their "1/3 of the company parses logs", excels at.
You'll note neither Matt nor Pierre offered to do anything special for PG. They didn't say "Oh! I love HN, and we want it to succeed; I'll go push out a manual override that makes sure we use only the one crawler you unblocked." They gave him exactly the info he'd get from Webmaster Tools plus general knowledge, and told him how to use Webmaster Tools to fix it. Maybe they'll go figure out how to implement that "you're blocked!" hover-over in SERPs, but again, that would benefit everyone, not just HN. There's no special treatment here.
To be fair, I've seen Cutts jump in on "no-name" threads where there is an issue with Googlebot or some search-related thing. It does not appear to be favoritism, beyond choice of forums in which to participate.
Fairness is what it's all about. I like yiur choice of wording. A lot of people have a problem with Matt Cutts jumping in because they feel it's not fair. They're mistaking equality for fairness. This is perfectly fair but not equal. Equality sucks. Fairness is better. Fairness is treatment based on merit and equality is just equal treatment no matter what the situation. Equality can really screw you if you're on the wrong side of it.
I see your point and maybe my attitude toward it is totally biased or morally/ethically off balance but I think it's just fine.
PG knows his shit. He knows how to block crawlers, he knows how get SEO done, he's no dummy. "So what?" you say. Well I work in web development and I meet with people all the time and explain to them how to run a site and get rankings up, etc. I tell them I'll give them the tools, show them how to use them and even leave detailed manuals on how to work everything. These people are so into it at our first meeting and they're excited to get into it and work this stuff.
Then the job is done, the tools are handed over, and the lessons are learned. What do they do? Nothing. Then they do nothing. Then even more nothing. The site is never updated and they do no maintenance whatsoever and eventually it becomes outdated and dead for all intents and purposes. A year later they want to know why they're not ranking in Google. I ask if they've checked or even once used any of the tools I gave them and the answer is always no. Then they demand I fix it. I kindly remind them they're the ones who wanted to have Larry in accounting run the site because he's "good at computers" and to save money by not hiring an IT guy.
So the point I'm making is that I'm sure there are a ton of people out there who are trying to get customer service without trying to help themselves or using the resources readily available. I think it's great that Google is showing they're willing to help those who help themselves.
I think google isn't treating everyone equally but they are treating people fairly. There's a difference. I'd prefer fairness. Not all "customers" are created equal so you compensate by treating them with fairness instead.
It seems he doesn't want to. From what I understand there's the possibility that he'd prefer to accomplish his goal without the use of Wemaster Tools. Like I said, he knows his shit, so he's going to do things on his terms if he can. Webmaster Tools might be what the average guy wants to use but then there are some who follow the beat of their own drummer whenever possible for whatever reasons we may never know.
This fiasco is not good evidence for pg knowing his shit. If he wants to block bots and he doesn't want to use the tools provided to make that work, maybe HN should be #10. It's even behind the wikipedia article describing it. Which is actually not bad for a dead site.
I know about that and it's on purpose. There's a number of reasons I want it that way for now. Way to be snarky for no particular reason. I didn't know offering up a sincere comment about PG roused so much emotion.
Actually I wasn't being snarky. Stuff like this happens and many times the owner of the site doesn't know it since they are using a local cached copy from their browser as only one example or the dns is cached so they are the last to find out. I was just pointing it out. And the only reason I saw it was that I read your comment. (Really) I see you've now removed the domain name from your profile now.
Yeah, I did. Sorry, I took that wrong. That used to be a working domain when I first had it in the profile. I just pointed the name servers to another hosting account but I'm not quite ready to do anything with it which is why it was dead. Thanks for reminding me it was in my profile though. I did forget that.
To be honest, I disagree. The fact is that people who know their shit shoot themselves in the foot occasionally. That's how they come by the shit they know. Timid people rarely know their shit because they are too afraid to make mistakes.
Being bold is an important aspect of getting to know your shit.
Also, besides this, this action has gotten us talking about how Google throttles crawling, and this discussion has further reduced my opinion of the World's Largest Adware Vendor.
Now if I were to ban Google searchbots based on what PG is saying I would simply limit them to the first page or two of listing. I think this can be done in a robots.txt?
Yes, it's VERY disturbing, no question about it. Matt can sound all helpful, and cuddly, and oxytocin-inducing all he wants, but he's basically stepping in and helping 1 site over another, when the whole process should NOT favor anyone. There are lots of webmasters out there that do stupid stuff like override a robots.txt accidentally, yet I don't see Matt sending them an email, asking them to check up on it.. Come'n... this is lame!
What do you think should happen in this scenario? Matt Cutts is aware that a reasonably high-profile site is not being indexed properly. Should he just ignore that?
I think it's particularly good that this is being done publicly, too. There's billions of sites out there and Google can't provide this for all of them, but since they're doing it publicly others can learn from it and know how to address it.
If Matt was artificially boosting HN's ranking that would be disturbing, but they're fixing an indexing issue.
Are you kidding? Matt is basically a superhero who helps anyone he can find. He's solved problems for people complaining on Twitter, on Google+, he hosts regular video office hours when anyone can come to him with problems, etc. It may be the first time YOU'VE seen him rush in to help someone, but this is Matt Cutts' standard MO.
No, he's not a super hero.. and no he doesn't help everyone that asks for it. I've brought up many issues over the years to him via Twitter, blog comments, emails, etc, and none of them.. count them.. ZERO have been addressed by him. He just gives preferential treatment to those sites that make a lot of noise, and might give Google or Google's quality team a lot of negative press. That's all he does. I've even brought up 100% apparent, clear spammy sites and he doesn't respond, and doesn't do anything about them. They're still there.
Lots of folks here like to give Matt Cutts this aura of an angel, or some sort of saint. Those who have known his actions over the past 5-10 years know better. Just ask guys like Aaron Wall and Rand. Matt Cutts is just a pawn that is there to do damage control for Google. That's all he is.
Hi AznHisoka. A lot of people send me spam reports, and I'm happy to pass them on. But normally we investigate the spam reports without sending back specific feedback of what action we took in response.
> normally we investigate the spam reports without sending back specific feedback of what action we took in response
I understand why you have to do this, but I wish there was at least some transparency regarding confirmation that an issue has at least been looked at, as opposed to just filed as a spam report.
I've only reported one issue to Google before, and the site in question, though incredibly obviously bad, is still the top link on the SERP for the business.
Specifically, there's a pizza place here in Halifax, NS that looks like it didn't renew it's domain in time, and some people are squatting it with an old copy of the site plus their own spammy links. The site displays a copy of the site from 2008 or so, with a banner on the top which reads:
> To previous domain owner: We bought this domain after expiration so it's not our fault that you lost it. We put old content for this domain only to avoid losing good quality of it from SEO point of view. If it's a problem for you contact us ASAP!
I reported it to Google months ago and never heard a word about it, but today when I search for the business name - which I'd imagine a lot of people do when looking for menus, phone numbers, etc. - the "compromised" site still ranks first on the SERP. I wonder how many people even notice the banner telling customers that <jedi>this is not the page you're looking for</jedi>.
If that were truly their goal, their ads would never be the useful, because if they were the best results, they would be duplicates of the organic search results (hence useless). Or, if they weren't the best results, then the best results are in the organic search.
Clearly that's not the case, as advertising makes 99% of their revenues.
That's exactly my point. I was responding to the parent's suggestion that "Google should simply deliver the best results, no matter what algorithm they use to do so." In the scenario you describe, the best results would be the stores (for example) where you would buy those shoes.
Remember, organic doesn't have to mean non-commercial. In fact, if I search for best buy, the first organic result is bestbuy.com
It is obvious that the best organic result for a nav query like [best buy] is bestbuy.com. The issue is with queries like [some product], should it show amazon first? or bestbuy? or some local reseller?
Of course that's the point. Again, I was responding to the parent ("Google should simply deliver the best results, no matter what algorithm they use to do so"), which I understood to be what results Google should choose to show. Perhaps I misunderstood his meaning for "best", but "best" in this context to me means the result that's most likely to satisfy the user.
I'm not disagreeing with the fact that ads are subject to a whole different objective function. I'm saying that if the best results were already in the organic, then the ads would either be redundant (for the user) or otherwise less useful than the organic (by definition, really).
All the people in this thread, including Google-related users, are trying to do is figure out what changed. That could just as easily have been at Google or HN's! It could also just as easily been a correct change.
If it turns out it was something HN changed and they did not use or even aware of the proper Google tools available for everyone to optimize their site, then I hardly think any reasonable person could object if it is pointed out to them - even if it's by Google personnel.
Interesting, but not disturbing. I think it's just a consequence of people at Google initially reading HN for the stories, then finding that the HN community forms a useful 'brains trust'. I for one am glad to see that HN has a hotline to the Kremlin ;)
Baseless comment. If they don't help, you will say - HN is one of the top sources of tech news in the world and it is surprising no one from Google is responding to such an obvious mistake. If they help, you comment as stated.
Secondly, HN is one of the most popular sources of tech news. By responding to this issue, Google is helping bring out possible solutions to a common problem - that crawlers are too frequent and affect the performance of a site. People like you and me can read the responses and learn what can be done to control this.
Third, popular websites like HN deserve attention like this because of the following they have garnered over the years. If,say "hahla blog" doesn't show as a top result in Google - it will take time for Google to verify whether it should even be shown as a top site. But, if something like HN or Amazon.com doesn't show as top result when users explicitly searches for it - that is a darn good use case for Google to fix the issue. It is an indication of something is definitely wrong somewhere that needs to be fixed.
What's more interesting is that people ONLY feel the way you do when the company in question is big. If it's a tiny startup doing some sort of unscalable marketing or customer support, nobody blinks an eye. But if a Google-sized organization does it, it's shocking or disturbing.
I think your best bet is to realize that, regardless of size, companies are always made up of people. And people are opinionated, subjective, and prone to making decisions that fall outside of some standard set of rules.
Let's be real here. Do 99.9999...% of users need or deserve this kind of support? It's easy to have sour grapes and say "oh the big shots get attention but I don't! Not fair!". But then you have to remember that people like PG worked their ass off to get a site like this to become so valuable and popular. We're not talking about some guy's blog about his cat that no one cares about, we're talking about HN here. Is it equal treatment? Of course not but is it fair? I say it is fair.
If fairness is treating people based on merit then it's totally fine. Matt isn't in here saying he's going to manually manipulate the result or change anything on Google's end (save for maybe correcting a flaw that would benefit everyone who's in a similar situation as HN). All he's doing is suggesting possible causes, trying to make a diagnosis, and just troubleshooting.
Most people who want this attention don't deserve it. They're the type of people who won't use the resources available like the google help documents or even learning how to administrate a site for best results in search engines. Plus, most people's website just aren't worthy of this attention.
I could go on but lets look at the reality here. We're all knowledgeable of how the web works here so let's just admit that HN is a damn popular site and it definitely seems like a fluke to have it rank as it was when this was submitted.
Was the query for a domain name or just hacker+news? How do we know?
Does <title> make any difference?
I don't think of this site as "Hacker News". I think of it as ycombinator, and the subdomain, news.
Should users of hackernews.com think of that site as something else, e.g. whatever is between the title tags?
A searchable list of domain names, ranked by popularity. Or even a searchable list of main page titles. Is that how some users are using Google? If so, Google does not need a full, current cached copy of the crawlable web to provide that.
Unsure. On several occasions I got the distinct impression that a lot of members aren't actually interested in "hacking" (for any of its definitions), instead you get people asking about the value of jailbreaking a phone: http://news.ycombinator.com/item?id=3169000
I suppose it is because the top hits below HN are perhaps even less about "hacker news" ?
Why is it that "our algorithms think the site is dead" so soon, yet when I search for a keyword, I still get sites that our dead or no longer contain the given keyword? Bad algorithms or is this "thinking dead" a time delayed thing?
Normally there's a lag between when a site goes dead and when we crawl the site to see that it's dead.
It's also tricky because you don't want a single transient glitch to cause a site to be removed from Google, so normally our systems try to give a little wiggle room in case individual sites are just under unusual load or the site is down only temporarily.
Not a webmaster, but would like to know:
Does/can the google crawler system send a notification email to the registered webmaster of a website, if something like this happens? This could be deemed unsolicited, but would in a huge majority of the cases be more than welcome!
All these errors are reported in Webmaster Tools in the Diagnostics section. Verify your site and you'll have access to this and a wealth of more data.
Also, we do send notifications in Webmaster Tools, and you can confgure those to be delivered by email too. I'm not sure if we send messages for these kinds of serious crawl errors, so I'll need to check. If not, that's an interesting idea I can ask the team to think about.
Has Google been making significant changes to the search ranking algorithms in the past couple months? I've noticed a significant decline in the quality of results, to the point that (for the first time in years), I've bounced over to DuckDuckGo or Bing to try my luck there. I love Google as a company, so (if this isn't just in my head) I'd love to see things get better again.
EDIT: Looks like the change to HN's ranking is related to a change that pg made, so my comment is now less relevant to the parent post. I still stand by it, though. :-)
If you can look back in your search history to find the specific searches that didn't do well for you, we're always happy to get really concrete examples.
The best reports look like "I did a search for [Bavarian red widgets] and the results weren't good because e.g. you were missing a specific page X that you used to return or should return, or you returned Austrian red widgets" or whatever.
Lots of Googlers are clearly hanging out on HN over Thanksgiving while they're stuck at relatives' houses. :)
I'm in Houston, TX. When I searched for 'windshield repair houston', the 1st result - wwwDOThoustonwindshieldrepairDOTnet - looked promising so I made an appointment with them to get my windshield repaired. When I went to their place of business, it was just a guy in a pickup truck in the parking lot of a strip mall, with a 'windshield repair' sign on the back of the truck.
Turns out he's running a scam where he gets ppl to file claims with their insurance, and when they pay him, he would kickback 50% to the customer. Having insurance pay for damages is common enough, plenty of businesses do it. But this guy was trying to get me to file claims for damages that I didn't even have. When I asked for a cash price just to repair the damage I did have, he refused saying that 'it wasn't enough money'.
You're welcome. I did a little more digging and wwwDOTpatscowindshieldrepairDOTcom is a mirror site with the same info that's ranking for the same terms, and the backlink profile for this site also shows lots of paid backlinks.
That's helpful, thanks. I haven't kept a record of those searches (they were mostly programming-related queries), but I'd be happy to submit reports in the future. Is there a URL or email address for this purpose?
This is awesome. Matt friggin Cutts is over here personally responding to not just the big shots (as in PG, and I don't mean that in a bad way at all) but some of us little people too. I see this stuff and it just inspires me. I really hope one day I can be in a position where all types of people, rich, poor, powerful, powerless, can all benefit from my skill or knowledge and that I would be the guy who helps the average Joe the same as the hot shot CEO. I hope I carry that with me as I build my own business. Thanks for being so cool, Matt Cutts.
I disagree. I mean, we don't want to keep it a secret but at the same time this site has always been really close knit and the people all really just "get it". If HN grows too large I fear for it. The web is full of Cretans and people who think they're thoughts are so worthy they must be heard. Lots of ignorance out there. I've seen some it here already. I learned the culture and studied the etiquette and what qualifies as a useful submission or comment before I really joined in. Not everyone is so thoughtful and it could really sour a once great community. You're obviously new around here so I'm not surprised you'd say that. I mean no disrespect - I do think you'll eventually change your mind though.
Out of curiosity, how did the OP even find out that this was going on? He appears to be a long-time user of the site, and therefore would have no reason to Google "Hacker News".
The most common other reason would be that some people use Google as their URL bar - instead of typing "hackerne.ws" or "news.ycombinator.com" into the URL bar, they type "Hacker News" into Google and click on the first result. However, I would've thought that the types of people using HN would have the tech savvy to use a keyword bookmark, or at least the URL.
Sometimes I want to find HN but am using a different device, sometimes that device is also a touchscreen and typing is a pain. Usually Google is very close to hand and the autocomplete saves typing out a full address or full set of keywords.
So I'd only need to type "hacke" in google and click "I'm feeling lucky" in the drop down rather than type "http://news.ycombinator.com/ into an address bar.
Didn't know about the hackerne.ws domain name however.
A few months ago, I was talking about the Stanford ai/ml-classes to one of my friends. He asked me if there were any more classes and I said "Yes, take a look at the front page in Hacker News, there are some links. I visit that site frequently, it's very helpful."
There was one little issue though.
The poor guy didn't know what hackernews was, so found that site (hackernews.com). He then scanned the Twitter stream over that site several times to find those links and started visiting the site for several days to find those other helpful links.
When I saw him again a few days later, he told me: "What a silly site HackerNews is! And I couldn't find the links to those classes over there."
He also told me that he was disappointed of me for visiting such a silly site.
Now, can you guess the look on his face when I told him that he was visiting the wrong site for the last few days?
DDG is using Bing for most general web results.
My site mysteriously disappeared from Bing and DDG a few days ago at the same time; but Gabriel said if I'm not cool enough to get a top ranking on HN for six hours to complain about it, I'll just have to go through the normal process Bing sets up for webmasters.