

Google employee explains how they use sitemaps and how we can benefit from them - stymiee
http://webmasters.stackexchange.com/q/30186/1253

======
buro9
I'd love to see how people approach generating their sitemaps.

I was doing it by batch, but once you go over a few million URLs I realised
the effort of doing so. And in a farm of cloud web servers I had one doing
this big batch job and then syncing the others... effectively a master.

So I scrapped that and went to dynamically generated upon request. But the
problem I found with this was content deletion changing the URLs in the higher
numbered sitemap files... i.e. content in the first few files get deleted, all
subsequent files shift slightly across the now visible URLs. Because Google
and others may only take some sitemaps one day, and some the next... you risk
appearing to have duplicate info in your sitemaps... I prefer long-cacheable
sitemaps anyway... the URLs in file #23 should always be in #23 and not
another file.

So I'm, moving towards dynamic generation based on a database table that
stores all possible URLs and will associate batches of 20,000 URLs per sitemap
file... if I delete content referenced by sitemap #1, then that now has 19,999
URLs and site map #2 remains at 20,000 URLs. A second benefit of such a table
is that I can use a flag to indicate whether the content has been deleted and
use that to determine whether to 404 or 410 when that URL is accessed.

If anyone feels that they have a better way of doing this, I'd love to know
it.

Ideally, it would be non-batch generated, and strongly associate a URL to a
given sitemap file.

~~~
franze
i have done a shitload of different sitmap.xml logic over the time, the most
advanced (which is still in use by a top 500 worldwide website) is a real time
sitemap. if a new dataset gets added, it is calculated if the dataset will
result in new landingpages or will lead to an update of an existing
landingpage. if the new/updated landingpages are within a content quality
range the "shelving logic" looks up into which shelve (speak sitemap.xml) it
belongs. it then sets the various last modified datas (which bubble up from
the landingpages to the sitemap.xml to the sitemap-index to the robots.txt)

we achieved a 97% crawl to landingpage ratio, and an overall 90%+ indexing
ratio. (after a lot of trial and errors with the quality metric) on sites with
35M+ pages.

said that, for sites with less than 10M pages i do not care anymore, just
submit big complete sitemaps, update them if when a sitemap gets added (or a
big bunch of them deleted). the overhead of running and maintaining a real
time sitemap for small websites (less than 10M pages) is just too much.

~~~
thenextcorner
You might want to ready my answer to this question on Quora:
[http://www.quora.com/If-I-have-a-website-with-millions-of-
un...](http://www.quora.com/If-I-have-a-website-with-millions-of-unique-pages-
should-I-submit-a-partial-sitemap-to-Google)

Where I explain how we did this at eBay when I was working there. eBay is
quite unique from the fact that millions of new URL's get created on a daily
basis, with a short lifespan.

------
hybrid11
Sitemaps are pretty obvious benefit, but here's a short testimony. I run a
small website, and I can testify that our sitemap (use
<http://www.sitemaps.org/> as reference) really helped us generate a lot of
traffic. Our website got indexed a lot more heavily, which in part generated
100K unique hits a month.

On a site note, you can use www.google.com/webmasters to find out stats on how
your website is being crawled by Google, and what needs to be improved.

TLDR - spend time building a good sitemap, it's worth it!

------
holgersindbaek
Anyone knows how you get the sublinks under your search, if your search is the
first one, like when you search for code academy:
[http://www.google.com/search?client=safari&rls=en&q=...](http://www.google.com/search?client=safari&rls=en&q=code+academy&ie=UTF-8&oe=UTF-8)
???

~~~
staunch
These are called "Sitelinks"
[http://support.google.com/webmasters/bin/answer.py?hl=en&...](http://support.google.com/webmasters/bin/answer.py?hl=en&answer=47334)

~~~
holgersindbaek
Arhhh... very cool. Doesn't seem like there's anything you can do for google
to show sitelinks for your search though?!

~~~
hybrid11
Google will automatically select the sitelinks for your site based on traffic
to subsections of your site. It takes a bit of time, but eventually it will
show up.

As well, if you don't like the sitelinks that Google picked for your site, you
can remove them through www.google.com/webmasters

~~~
holgersindbaek
Ok... I'll play the waiting game then. Hate the waiting game :-/.

------
franze
if you are wondering what nice stats you can get from the combination of
sitemap.xml + google webmaster tools figures.

some time ago i did a little test with a site (500K+ pages) that had indexing
issues.

we sliced and diced that sitemap into a very very simple content quality
metric (think char count of the main (medium unique) content) (the higher, the
less content)

this is what we got
[https://img.skitch.com/20120123-bm3jpjdtt4xrr2t2mxqnikmu54.p...](https://img.skitch.com/20120123-bm3jpjdtt4xrr2t2mxqnikmu54.png)

then we submitted the same landingpages sliced and diced after URL length (the
higher, the longer the URL)
[https://img.skitch.com/20120123-kqnnb15bj2puy2k276jiupxg12.p...](https://img.skitch.com/20120123-kqnnb15bj2puy2k276jiupxg12.png)

------
matznerd
I thought this was pretty standard so here are some other tips for anyone
starting a new site or looking to increase their SEO. Sitemaps and metatags
are the main things you need on every site+page. I suggest you then write a
600-1000 word article using your keyword as the title and in the URL. Then use
the same keyword 1-2% of the time in the body, so a 1000 word article has 10
mentions. Plus use the keyword in an H1, H2, and H3 tag. Also use the keyword
in one link to another page on your site. You also want to include at least
one off page link with anchor text of your keyword to a high-authority site
like wikipedia. And then after that is done, I suggest you ping it using
something like www.pingomatic.com to make it get crawled more quickly. Any
other q's feel free to msg or email me.

~~~
thwarted
This is an SEO suggestion straight out of 2003.

~~~
matznerd
All off the stuff I said is valid as of today. If this was 2003 the advice
would be to keyword stuff the hell out of it. Then get thousands of shitty
backlinks using comments from blogs and forum profiles. Actually that's like 2
years ago. In 2003 Google could be influenced much more easily.

~~~
thwarted
That may be, but the problem is that that's a losing battle. It's the same,
relatively easy strategy that everyone else is doing, and thus a losing
proposition.

I really don't see how what you suggest is significantly different from
keyword stuffing. You're not stuffing the page, you're stuffing the links
between the pages: the blog post and your site (and the fact that you're
suggesting

Search engines, including Google, have an interest in leading users to
_useful_ pages. A shill blog post stuffed with keyboards and backlinks to your
target content is not useful. It's not useful by definition that you didn't
add or create any useful content, instead you are working on gaming the search
engine. If legitimately useful pages, that users want to see, _are not_
appearing at the top of search results, the (legitimate) search engines will
figure out a way to detect those and put them there, since that's what users
want. Meanwhile, you'll be doing the same thing everyone else is doing (link
stuffing and low-content search engine food blog posts), so your site becomes
indistinguishable from the millions of other cheap attempt search engine
spamming sites.

Instead, build a useful product or site with useful information that's
engaging to users. Of course this seems like the harder thing to do, but, duh,
TANSTAAFL.

It's worth pointing out that the OP is talking about how to tell google your
site exists, how to detect duplicate content, and a bunch of other stuff that
if I list it here I'll be repeating it, not how to game it. Saying one should
write a fluff blog post for the link juice is attempting to game it.

