These are good questions! There are a lot of subtleties here. One of them is that for real one-way functions you generally get a big number and you need some sort of reduction back to the password domain. Let's call it '¬' because most fonts have that character these days. Chaining actually looks like
password → number ¬ new_password → new_number ¬ ...
Second, both the one-way function '→' and the reduction '¬' are not one-to-one mappings. This process will give you a preimage—i.e. given H it can give some password P such that P → H and you are happy—but that password will not necessarily be the original password! Usually any preimage will do, thankfully, but it's still a problem because this means that two different chains can 'merge' into one. This ties very closely to the question you're asking.
As you have surmised, 100% coverage of the input domain is also really hard to achieve because you can't control which words the one-way function (and therefore the reduction) is going to choose! Many rainbow tables just settle for some inaccuracy -- so if you cover 99.9% of the input domain, you don't worry about trying to get all of the other 0.1%.
The reason these two completely different-sounding things are connected is that if you think about it, if you find one of those 0.1% and feed it to your hash function and then the reduction -- there's actually a 99.9% chance that you end up on a chain you already have! So you have these super-short chains which immediately merge into other chains!
If you were trying to compress by a factor of 1000, then you might choose a chain length of 2000, try for 99.95% coverage, and then store these missing 0.05% as a separate lookup table -- then between the two tables you have 100% coverage. Or, you might choose a chain length of 1000 with, say, a 99% coverage, and just accept a 1% failure rate on hashes. Either one is acceptable depending on how you store the chains and how much work you want to do to ensure coverage.
It may not always be possible to get perfect compression. Suppose for some reason you have the property that the hash for 10% of the words is "be" and the hash for a further 50% of the words is one of the words that hashes to "be." Using rainbow tables for this is not going to go well because all your chains are going to end up at "be" inside of a small number of iterations. But a hash with this property will also fail at its intended purpose because Alice could claim the hash for a word she doesn't know is 'be' and have a 10% chance of being right.
Good hash functions are specifically designed to make hash collisions like that as rare as possible, so in practice the compression ratio will be strongly correlated to the chain length, but you can still have instances where one hash value is slightly more common than another. The result is just that if you use a chain length of 100, the compression ratio in practice will be slightly worse than 100:1.
As to how to choose the heads of the chains, I suspect the answer for getting the absolute optimal choices will either be NP complete or will involve some application of linear algebra. If all else fails you could just keep adding random chains rooted in a value that isn't in any existing chain until you have full coverage. I don't know what people do in practice.