

Ask HN: How do huge sites with Millions of URL manage their sitemap? - techaddict009

Sites like twitter, facebook, quora, etc. probably would be having sitemap so that the search engines can crawl them in better way.<p>How do they manage such a huge sitemap?<p>What kind of special tools do they use for it?
======
PaulHoule
I have a few sites with a few million Uris and found that the sitemap leads to
much faster and better indexation than I would have had otherwise.

The sitemap protocol is a little bit awkward at this point since a sitemap
file can only have 50k links so then you need a sitemap index. If you are not
exceptionally careful you will see files updated when the crawlers are
downloading them, sitemap indexes not perfectly synced to the sitemaps and
other things that cause transient errors. In theory you could be careful with
how things are date stamped to reduce the considerable load of downloading
your site maps but web crawlers don't 100% trust assertions of how often
things get updated.

Adding it up sitemaps work OK for a million pages, but probably not for a
billion.

------
someguy1233
I can't seem to find any evidence of a sitemap on Twitter, Facebook or Quora.

Quora has THIS: [http://www.quora.com/sitemap](http://www.quora.com/sitemap)

I can't find any sort-of XML, JSON, or even TXT sitemap on any major site like
that.

They may have a special agreement with the search engines, or maybe their
sitemaps only appear to search bot user agents/IP's.

The other alternative could be that they simply DON'T HAVE A SITEMAP. Search
bots might just be crawling them as normal, or with some sort-of link feed
agreement with the companies, which lets them index as soon as a new link is
created internally.

