A site can be crawled from any number of Googlebot IP addresses, and so blocking all except one doesn't help in throttling crawling.
If you verify the site in Webmaster Tools, we have a tool you can use to set a slower crawl rate for Googlebot, regardless of which specific IP address ends up crawling the site.
Let me know if you need more help.
Edit Detailed instructions to set a custom crawl rate:
1. Verify the site in Webmaster Tools.
2. On the site's dashboard, the left hand side menu has an entry
called Site Settings. Expand that and choose the Settings submenu.
3. The page there has a crawl rate setting (last one). It defaults to
" Let Google determine my crawl rate (recommended)". Select
"Set custom crawl rate" instead.
4. That opens up a form and choose his desired crawl rate in crawls per second.
If there is a specific problem with Googlebot, you can reach the team as follows:
1. To the right hand side of the Crawl Rate setting is a link called
"Learn More". Click that to open a yellow box.
2. In the box is a link called Report a problem with Googlebot which
will take you to form you can fill out with full details.
Instead of completely disregarding Crawl-Delay, why not support it up to a maximum value that is deemed sensible? This would prevent people from completely shooting themselves in the foot, and it would surely be better than completely disregarding it.
I would think that the number of people who (a) know how to create a valid robot.txt file, (b) have some idea of how to use the "crawl-delay" directive and (c) write a "shoot-themselves-in-the-foot" worthy error is vanishingly small.
As opposed to --for illustrative purposes-- the vanishingly small number of people who know how to block IP addresses and manage to get their site to disappear from Google's listings?
A few years ago, I did pretty much the same thing myself. Thankfully the late summer was our slow season and the site recovered pretty quickly from my bone-headed move, but the split second after I realized what I've done was bone-chilling.
I think just about everyone has thought at some point that they understood how something worked, only to have had things go pear-shaped on them.
The lesson: people are not fully knowledgeable about everything, even the smart and talented ones.
"You would not believe the sort of weird, random, ill-formed stuff that some people put up on the web: everything from tables nested to infinity and beyond, to web documents with a filetype of exe, to executables returned as text documents. In a 1996 paper titled "An Investigation of Documents from the World Wide Web," Inktomi Eric Brewer and colleagues discovered that over 40% of web pages had at least one syntax error".
We can often figure out the intent of the site owner, but mistakes do happen.
The number of webpages with HTML that's just plain wrong (and renders fine!) is staggering. I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.
The downside is maintainability. If your website follows the rules, you can be pretty confident that any weird behaviour you see is a problem with the browser (which is additional context you can use when googling for a solution). If your website requires browsers to quietly patch it into a working state, you have no guarantees that they'll all do it the same way and you'll probably spend a bunch of time working around the differing behaviour.
Obviously, that's not a problem if you already know exactly how different browsers will treat your code, or you're using parsing errors so elemental that they must be patched up identically for the page to work. For example, on the Google homepage, they don't escape ampersands that appear in URLs (like href="http://example.com/?foo=bar&baz=qux — the & should be &). That's a syntax error, but one that maybe 80% of the web commits, so any browser that couldn't handle it wouldn't be very useful.
It's interesting that you apparently actually checked a validator to get the error count, and yet the two things you cite as errors are not errors, have never been errors, and are not listed in the errors returned by the validator. Both opening and closing tags for the html, body, and head elements are optional, in all versions of HTML that I am aware of (outside of XHTML, which has never been seriously used on the open web as it isn't supported by IE pre-9). There is a tag reported unclosed by the validator, but that's the center tag.
Anyhow, one downside to having syntax errors might be that parsers which aren't as clever as those in web browsers, and which haven't caught up with the HTML5 parser standard, might choke on your page. This means that crawlers and other software that might try to extract semantic information (like microformat/microdata parsers) might not be able to parse your page. Google probably doesn't need to worry about this too much; there's no real benefit they get from having anyone crawl or extract information from their home page, and there is significant benefit from reducing the number of bytes as much as possible while still remaining compatible with all common web browsers.
I really wish that HTML5 would stop calling many of these problems "errors." They are really more like warnings in any other compiler. There is well-defined, sensible behavior for them specified in the standard. There is no real guesswork being made on the part of the parser, in which the user's intentions are unclear and the parser just needs to make an arbitrary choice and keep going (except for the unclosed center tag, because unclosed tags for anything but the few valid ones can indicate that someone made a mistake in authoring). Many of the "errors" are stylistic warnings, saying that you should use CSS instead of the older presentational attributes, but all of the presentational attributes are still defined and still will be indefinitely, as no one can remove support for them without breaking the web.
Considering it's Google and we're talking about almost the whole population of the Earth, the vanishingly small percentage of the entire population of the planet would still be at least hundreds of thousands of people.
I do understand the thought but I think it is not a good gesture to do. You could always cap crawl-delay at a reasonable maximum and additionally allow people to fix mistakes through the webmaster tools (eg if they told your bots to stay away for a long time but in the meantime want to revert that).
Maybe instead that hostload could be parsed from robots.txt? It sure seems like the better mechanic to tweak for load issues (while traffic/bandwidth issues are still unresolved).
Could google vary the crawling rate on each site and see what effects that has on response times, and develop an algorithm to adjust crawl speed so as not to affect site performance too much? If google starts crawling a site and notices sequential crawl requests are answered in .5s w/ .1s stddev and it starts crawling with 10 parallel connections and the answers are 2s w/ 1s stddev, clearly that's a problem because user experience for real people will be impacted. Maybe google could automatically email webmaster@ and notify them of performance issues it sees when crawling.
Another thing that might help google is for them to announce and support some meta tag that would allow site owners (or web app devs) to declare how likely a page is to change in the future. Google could store that with the page metadata and when crawling a site for updates, particularly when rate limited via webmaster tools, it could first crawl those pages most likely to have changed. Forum/discussion sites could add the meta tags to older threads (particularly once they're no longer open for comments) announcing to google that those thread pages are unlikely to change in the future. For sites with lots of old threads (or lots of pages generated from data stored in a DB and not all of which can be cached), that sort of feature would help the site during google crawls and would help google keep more recent pages up to date without crawling entire sites.
Have you considered putting a caching reverse proxy in front of the arc app to keep the backend from having to render all of the old pages?
The conspicuous lack of a "Server:" header inclines me to believe that that's probably not the case (most web servers set one indicating the server software and version). Here are the headers that HN sends out from an old post (20 days ago):
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
By ensuring that your pages are valid, you make it ever so much more likely that you will not have to scramble around wasting time at a most inopportune time when the new version of a browser comes out which handles your non-standards compliant tag soup differently than the current version of the browser.
So, do you want to pay the price upfront when you can plan for it or afterwards when the fix must be done immediately because customers are complaining?
Why don't you bother doing your real work right the first time? As long as there's a well defined spec, you might as well follow it instead of being creative and original when it comes to implementing standards.
We use varnish for caching and check the useragent for requests.
If the cache has a copy of an article that is a few hours old it will just give that version to Googlebot while if it thinks a human is requesting the page then it will go to the backend and fetch the latest version.
This is why I don't understand the anti-promotion crowd who think promoting oneself and building an audience is a bad thing. Having the implicit threat of an audience you can address is a major lever to getting decent service nowadays.
I'm certainly not against promoting myself where I think some attention is merited, but ultimately that kind of thing is close to a zero sum game in that the amount of 'famous' people is fairly limited.
In the lower-right corner of Google Maps there is a tiny link that says, "Edit in Google Map Maker". Click this link and you can edit Google Maps. Your edits get sent to Google and they'll approve/deny it in typically a few days.
it's a verified listing (do you know hard it is to convince the IT dept to take their automated phone system offline so i could verify a google maps listing? It was nuts and no google didn't offer the postcard method) so i don't see why i have to enter the same info again, but i did.
the listing only shows up if you type the exact name of the hospital into the search bar, which is useless.
Does that mean at Google you can manually set the system to treat different sites differently?
I don't mean the ranking but other aspects - like you guys blacklisted some domains which produce low quality content in wholesale. (I don't know if the algorithm was tweaked to detect and filter such sources or if it was a manual thing.)
Nope. As mentioned above, apparently Google thinks people will "shoot themselves in the foot" with the crawl-delay directive, while they won't with Google's special interface (which requires registering and logging in).
What more can you expect from the World's Largest Adware Company?
Seriously, one thing about Google is that they seem to really like ensuring people are logged on, preferably at all times. Fortunately recent changes to Google Apps (promoting apps user accounts to full Google accounts) has made this more complex on my side and probably degraded the level of actionable info they can get out of it.