Hacker News new | comments | ask | show | jobs | submit login
Google employee explains how they use sitemaps and how we can benefit from them (stackexchange.com)
186 points by stymiee on May 30, 2012 | hide | past | web | favorite | 23 comments

I'd love to see how people approach generating their sitemaps.

I was doing it by batch, but once you go over a few million URLs I realised the effort of doing so. And in a farm of cloud web servers I had one doing this big batch job and then syncing the others... effectively a master.

So I scrapped that and went to dynamically generated upon request. But the problem I found with this was content deletion changing the URLs in the higher numbered sitemap files... i.e. content in the first few files get deleted, all subsequent files shift slightly across the now visible URLs. Because Google and others may only take some sitemaps one day, and some the next... you risk appearing to have duplicate info in your sitemaps... I prefer long-cacheable sitemaps anyway... the URLs in file #23 should always be in #23 and not another file.

So I'm, moving towards dynamic generation based on a database table that stores all possible URLs and will associate batches of 20,000 URLs per sitemap file... if I delete content referenced by sitemap #1, then that now has 19,999 URLs and site map #2 remains at 20,000 URLs. A second benefit of such a table is that I can use a flag to indicate whether the content has been deleted and use that to determine whether to 404 or 410 when that URL is accessed.

If anyone feels that they have a better way of doing this, I'd love to know it.

Ideally, it would be non-batch generated, and strongly associate a URL to a given sitemap file.

i have done a shitload of different sitmap.xml logic over the time, the most advanced (which is still in use by a top 500 worldwide website) is a real time sitemap. if a new dataset gets added, it is calculated if the dataset will result in new landingpages or will lead to an update of an existing landingpage. if the new/updated landingpages are within a content quality range the "shelving logic" looks up into which shelve (speak sitemap.xml) it belongs. it then sets the various last modified datas (which bubble up from the landingpages to the sitemap.xml to the sitemap-index to the robots.txt)

we achieved a 97% crawl to landingpage ratio, and an overall 90%+ indexing ratio. (after a lot of trial and errors with the quality metric) on sites with 35M+ pages.

said that, for sites with less than 10M pages i do not care anymore, just submit big complete sitemaps, update them if when a sitemap gets added (or a big bunch of them deleted). the overhead of running and maintaining a real time sitemap for small websites (less than 10M pages) is just too much.

You might want to ready my answer to this question on Quora: http://www.quora.com/If-I-have-a-website-with-millions-of-un...

Where I explain how we did this at eBay when I was working there. eBay is quite unique from the fact that millions of new URL's get created on a daily basis, with a short lifespan.

We do the simplest thing we could imagine. Our sitemap of ~50,000,000 entries is written to static xml once a week as part of a batch job and pushed to S3. Is there any reason to believe it needs to be updated near real time? How often does Google read yours?

one thing is: freshness is a factor for google ranking. so if you could communicate a fresh page to google in exact the moment when new content arrives (+ there is a chance, that there is a "fresh" spike for search demand) then it's a factor.

but more important:

also with 50 000 000 URLs, as your site gets crawled with about 500 000 pages a day (which is average) or 1M pages a day (which is good) it takes already 50 to 100 days to index your whole site - so it makes sense to communicate only the changed sitemaps (at the exact time when they changed) to google, as the sitemaps get fetched quite fast you up your chances, that the new LP gets crawled/indexed faster. it depends on how fast your page turnaround is (new pages, updated pages, deleted pages) if it makes sense for you, or not.

(p.s.: in most cases for most business, a (near) real-time sitemap is overhead.)

Sitemaps are pretty obvious benefit, but here's a short testimony. I run a small website, and I can testify that our sitemap (use http://www.sitemaps.org/ as reference) really helped us generate a lot of traffic. Our website got indexed a lot more heavily, which in part generated 100K unique hits a month.

On a site note, you can use www.google.com/webmasters to find out stats on how your website is being crawled by Google, and what needs to be improved.

TLDR - spend time building a good sitemap, it's worth it!

Anyone knows how you get the sublinks under your search, if your search is the first one, like when you search for code academy: http://www.google.com/search?client=safari&rls=en&q=... ???

Arhhh... very cool. Doesn't seem like there's anything you can do for google to show sitelinks for your search though?!

Google will automatically select the sitelinks for your site based on traffic to subsections of your site. It takes a bit of time, but eventually it will show up.

As well, if you don't like the sitelinks that Google picked for your site, you can remove them through www.google.com/webmasters

Ok... I'll play the waiting game then. Hate the waiting game :-/.

if you are wondering what nice stats you can get from the combination of sitemap.xml + google webmaster tools figures.

some time ago i did a little test with a site (500K+ pages) that had indexing issues.

we sliced and diced that sitemap into a very very simple content quality metric (think char count of the main (medium unique) content) (the higher, the less content)

this is what we got https://img.skitch.com/20120123-bm3jpjdtt4xrr2t2mxqnikmu54.p...

then we submitted the same landingpages sliced and diced after URL length (the higher, the longer the URL) https://img.skitch.com/20120123-kqnnb15bj2puy2k276jiupxg12.p...

I thought this was pretty standard so here are some other tips for anyone starting a new site or looking to increase their SEO. Sitemaps and metatags are the main things you need on every site+page. I suggest you then write a 600-1000 word article using your keyword as the title and in the URL. Then use the same keyword 1-2% of the time in the body, so a 1000 word article has 10 mentions. Plus use the keyword in an H1, H2, and H3 tag. Also use the keyword in one link to another page on your site. You also want to include at least one off page link with anchor text of your keyword to a high-authority site like wikipedia. And then after that is done, I suggest you ping it using something like www.pingomatic.com to make it get crawled more quickly. Any other q's feel free to msg or email me.

Honestly, this is the kind of stuff I hate. The Google employee has suggested how best to use a sitemap to allow Google to index your existing content. You recommend writing an un-necessary article, packing it with keywords and links. That's the last thing the internet needs more of.

I didn't say unnecessary or even crappy content, and there is no stuffing involved. I meant to simply use the antecedent instead of a pronoun, it doesn't change the content. Google is an algorithm and it does not read "stop words" (http://en.wikipedia.org/wiki/Stop_words). And I was just suggesting that you use authority links with your anchor word, nothing unnecessary, but most posts do include links in them, so just make sure it isn't a shitty one. Content is king, everyone knows that, but I was just telling you what to do when starting out so that you have a good base for your site.

This is an SEO suggestion straight out of 2003.

All off the stuff I said is valid as of today. If this was 2003 the advice would be to keyword stuff the hell out of it. Then get thousands of shitty backlinks using comments from blogs and forum profiles. Actually that's like 2 years ago. In 2003 Google could be influenced much more easily.

That may be, but the problem is that that's a losing battle. It's the same, relatively easy strategy that everyone else is doing, and thus a losing proposition.

I really don't see how what you suggest is significantly different from keyword stuffing. You're not stuffing the page, you're stuffing the links between the pages: the blog post and your site (and the fact that you're suggesting

Search engines, including Google, have an interest in leading users to useful pages. A shill blog post stuffed with keyboards and backlinks to your target content is not useful. It's not useful by definition that you didn't add or create any useful content, instead you are working on gaming the search engine. If legitimately useful pages, that users want to see, are not appearing at the top of search results, the (legitimate) search engines will figure out a way to detect those and put them there, since that's what users want. Meanwhile, you'll be doing the same thing everyone else is doing (link stuffing and low-content search engine food blog posts), so your site becomes indistinguishable from the millions of other cheap attempt search engine spamming sites.

Instead, build a useful product or site with useful information that's engaging to users. Of course this seems like the harder thing to do, but, duh, TANSTAAFL.

It's worth pointing out that the OP is talking about how to tell google your site exists, how to detect duplicate content, and a bunch of other stuff that if I list it here I'll be repeating it, not how to game it. Saying one should write a fluff blog post for the link juice is attempting to game it.

Some people are still stuck there.

Why not focus your efforts on developing a great product or service? If it deserves the number one spot for the top keywords in your industry, it'll get there, I guarantee it ;)

Unfortunately that is not completely true, especially in the short term. If you are relying on organic search traffic for your product or site, no matter how good it is, you may have difficulty being found, especially if the product has a common name. Aged websites (+1 year) have a large advantage vs any site less than a year or even 3 months old. A site with a high number of backlinks and a good domain (EMD) will be extremely difficult to displace. Fortunately SEO is around 40%-65% onsite these days and so if you do have a good product and service it is possible you will rise up, although it will be less likely if you do not have optimized on-page content and structure (like Sitemaps).

Of course aged websites have an advantage against the short-term thinker with a 3 month old site, and that's how it should be. The chances that an established site that has a record of being regularly updated with fresh, relevant content has more useful content than a new, 3 month old site, is higher (but not guaranteed). This is as it should be: those who put in more work to create useful content should win for higher, more relevant placement.

Stop giving the rest of us a bad name

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact