My intuitive answer to that problem was to say that if we assume the captured se...

lostcolony · on June 18, 2021

If there are 100 tanks, and you get 1, 2, 5, and 99, your method would give 54 tanks ((1 + 2 + 5 + 99)/4 * 2), which is obviously wrong.

Your error is in stating "if we assume the captured serial numbers are randomly distributed" - you're assuming they're -uniformly- distributed. Randomly distributed != uniformly distributed.

Their method would give you 125 as a guess. It's including the known info (i.e., adding "m") to take into account the fact that they're not necessarily evenly distributed.

On that note, if you continued to get tanks at low numbers (3, 4, 6, etc), averaging gets -less- accurate, because that 99 becomes more and more of an outlier. Their method gets MORE accurate, again, because they're taking advantage of all data that is known (we know it goes at least to 99), and averaging doesn't. The new low numbers we've added mean that there are less likely to be many tanks, and the formula in the link takes that into account with m/k.

Both methods will be accurate if you have 100% of the data, but taking twice the average ignores known data, so the sparser the data the less likely it is to be correct.

saalweachter · on June 18, 2021

Hmmm, on the other hand, suppose you first find a tank with the serial number 1234.

Then the next 50 tanks you find are all from the range [1, 100].

Is it more reasonable to assume that there are around 1258 tanks, or that there are probably closer to 100 tanks, and that first one with the very large serial number was not a sequentially numbered tank?

lostcolony · on June 18, 2021

Certainly!

But, from the article's initial proposition - "You do know that the Germans have a sequential numbering system (1, 2, …, n)" and in giving historical context "On investigation, it became clear that the serial numbers were sequential, without gaps."

So, yes, without that being a prior, of course it's more likely that that outlier is a strange one off, and you'd do better to exclude it from your data set (and/or continue to investigate, because it's NOT at all clear that the serial numbers are sequential yet).

But, that context and ordering matters. Assume just the opposite series of events - you started by finding 50 tanks with serial numbers [1, 100]. And then three or four months go by you didn't get any tank serials sent to you. And then you get 1234. 1258 tanks seems really reasonable at that point (and, in fact, would fit the reality; the Germans were producing ~256 tanks per month per the article).