I was doing it by batch, but once you go over a few million URLs I realised the effort of doing so. And in a farm of cloud web servers I had one doing this big batch job and then syncing the others... effectively a master.
So I scrapped that and went to dynamically generated upon request. But the problem I found with this was content deletion changing the URLs in the higher numbered sitemap files... i.e. content in the first few files get deleted, all subsequent files shift slightly across the now visible URLs. Because Google and others may only take some sitemaps one day, and some the next... you risk appearing to have duplicate info in your sitemaps... I prefer long-cacheable sitemaps anyway... the URLs in file #23 should always be in #23 and not another file.
So I'm, moving towards dynamic generation based on a database table that stores all possible URLs and will associate batches of 20,000 URLs per sitemap file... if I delete content referenced by sitemap #1, then that now has 19,999 URLs and site map #2 remains at 20,000 URLs. A second benefit of such a table is that I can use a flag to indicate whether the content has been deleted and use that to determine whether to 404 or 410 when that URL is accessed.
If anyone feels that they have a better way of doing this, I'd love to know it.
Ideally, it would be non-batch generated, and strongly associate a URL to a given sitemap file.
we achieved a 97% crawl to landingpage ratio, and an overall 90%+ indexing ratio. (after a lot of trial and errors with the quality metric) on sites with 35M+ pages.
said that, for sites with less than 10M pages i do not care anymore, just submit big complete sitemaps, update them if when a sitemap gets added (or a big bunch of them deleted). the overhead of running and maintaining a real time sitemap for small websites (less than 10M pages) is just too much.
Where I explain how we did this at eBay when I was working there.
eBay is quite unique from the fact that millions of new URL's get created on a daily basis, with a short lifespan.
but more important:
also with 50 000 000 URLs, as your site gets crawled with about 500 000 pages a day (which is average) or 1M pages a day (which is good) it takes already 50 to 100 days to index your whole site - so it makes sense to communicate only the changed sitemaps (at the exact time when they changed) to google, as the sitemaps get fetched quite fast you up your chances, that the new LP gets crawled/indexed faster. it depends on how fast your page turnaround is (new pages, updated pages, deleted pages) if it makes sense for you, or not.
(p.s.: in most cases for most business, a (near) real-time sitemap is overhead.)
On a site note, you can use www.google.com/webmasters to find out stats on how your website is being crawled by Google, and what needs to be improved.
TLDR - spend time building a good sitemap, it's worth it!
As well, if you don't like the sitelinks that Google picked for your site, you can remove them through www.google.com/webmasters
some time ago i did a little test with a site (500K+ pages) that had indexing issues.
we sliced and diced that sitemap into a very very simple content quality metric (think char count of the main (medium unique) content) (the higher, the less content)
this is what we got https://img.skitch.com/20120123-bm3jpjdtt4xrr2t2mxqnikmu54.p...
then we submitted the same landingpages sliced and diced after URL length (the higher, the longer the URL) https://img.skitch.com/20120123-kqnnb15bj2puy2k276jiupxg12.p...
I really don't see how what you suggest is significantly different from keyword stuffing. You're not stuffing the page, you're stuffing the links between the pages: the blog post and your site (and the fact that you're suggesting
Search engines, including Google, have an interest in leading users to useful pages. A shill blog post stuffed with keyboards and backlinks to your target content is not useful. It's not useful by definition that you didn't add or create any useful content, instead you are working on gaming the search engine. If legitimately useful pages, that users want to see, are not appearing at the top of search results, the (legitimate) search engines will figure out a way to detect those and put them there, since that's what users want. Meanwhile, you'll be doing the same thing everyone else is doing (link stuffing and low-content search engine food blog posts), so your site becomes indistinguishable from the millions of other cheap attempt search engine spamming sites.
Instead, build a useful product or site with useful information that's engaging to users. Of course this seems like the harder thing to do, but, duh, TANSTAAFL.
It's worth pointing out that the OP is talking about how to tell google your site exists, how to detect duplicate content, and a bunch of other stuff that if I list it here I'll be repeating it, not how to game it. Saying one should write a fluff blog post for the link juice is attempting to game it.