
German tank problem - fortepianissimo
http://en.wikipedia.org/wiki/German_tank_problem
======
T-hawk
This actually hit my previous company in a software context.

We would number our hotfixes sequentially. Many would be items demanded by a
single client, so would get deployed as hotfixes only to that customer's site,
and just rolled into the main trunk for the next quarterly release for
everyone else. Clients would always be notified about hotfixes going onto
their live sites.

One savvy client noticed the hotfix numbering sequence. Naturally, that ensued
quite a number of extremely awkward discussions as they would regularly ask
why our software needed so many hotfixes (tens per week) and why they weren't
entitled to all of them right away.

Solution: a new policy to randomly generate hotfix numbers. Which of course
led to the next problem, that now the sequence was not obvious from the names,
so dependent hotfixes would sometimes get deployed in the wrong order. Why
can't anything be easy...

~~~
vacri
Just name the hotfixes by day (140222). It monotically increases, and if you
do multiple hotfixes in one day, suffix a/b/c etc. Generally you're unlikely
to get up to b or c, and there's no clue to how many previous versions there's
been.

~~~
hcarvalhoalves
Drop the a,b,c... just timestamp it.

~~~
vacri
Leads to longer timestamps though, when you include the HHMM. If you're
generally not doing more than one per day, this makes it a little easier -
usually six digits instead of ten.

~~~
mseebach
If only we had access to some kind of machine that made dealing with long
numbers easy.

~~~
vacri
Except that a hotfix number is a human-interaction number, not an automation
device. Speaking as someone who has worked on the support phones, I'd much,
much, much rather someone only have to read back a six-digit number than a
ten-digit one.

------
sparkman55
There is some practical relevance to software development here. One shouldn't
expose sequential IDs (a.k.a. serial numbers) to the public for anything non-
public.

I see this Hacker News post has a numerical ID in the URL, for example; I can
estimate the size of Hacker News given enough of these numbers... More
directly, I can modify that numerical ID to crawl Hacker News.

Many sites do this; it's generally better to generate a (random or hashed or
generated from a natural key) 'slug' to use as the key instead. For example,
Amazon generates a unique, non-sequential, 10-digit alphanumeric string for
each item in their catalog.

~~~
pavel_lishin
The flipside is that you can give off the impression of having a large user
base/product catalog/etc if you number things sequentially... but start at a
large non-round number.

~~~
sparkman55
It seems like one could use the same technique to estimate the initial
(lowest-observable) serial number...

From the article:

    
    
      If starting with an initial gap between 0 and the lowest
      sample (sample minimum), the average gap between samples is
      (m - k)/k; the -k being because the samples themselves are
      not counted in computing the gap between samples.
    

Perhaps someone with a better grasp on the math can confirm that this makes
'obfuscating size by starting with a higher serial number' an ineffective
mechanism?

~~~
gweinberg
Yes. If you're only looking at the gaps between the numbers, adding a constant
offset to the serial numbers would have no effect on the estimate.

On the other hand, if instead of ordering them sequentually I roll a die and
add the number of spots to the previous serial number, I think I can trick you
into thinking I have three times as many tanks as I actually do. In fact, I
feel quite confident of it.

~~~
Someone
If I find sufficiently many of your tanks, the distribution of the differences
in serial numbers would start showing that we aren't talking about a random
sample from 1…n

For example, having seen 250 IDs in the 1…1000 range and 200 in the 1001…2000
range, the next ID in the 1…2000 range I see should fall in the 1…1000 range
with probability 750/(750 + 800) ~= 0.48 in the 'normal' case, and around
36/(36 + 86) ~= 0.30 with your method of doling out IDs.

And I think it would be a factor of 3.5 (the expected number of eyes on a
throw of a die). That's why I expect your method to dole out 286 out of every
1000 IDs.

But it would require me from checking for this, depend on finding such
discrepancies in samples, and increase the variance on my estimates for a
given sample size.

------
jxf
It's astounding how accurate they were using only statistical methods:

> Analysis of wheels from two tanks (48 wheels each, 96 wheels total) yielded
> __an estimate of 270 produced in February 1944 __, substantially more than
> had previously been suspected.

> German records after the war showed production for the month of February
> 1944 was __276 __.

~~~
ta53535
It wasn't exclusively statistical:

    
    
        The analysis of tank wheels yielded an estimate for the
        number of wheel molds that were in use. A discussion with
        British road wheel makers then estimated the number of
        wheels that could be produced from this many molds...

~~~
Someone
You can't do statistics without data. You aren't complaining that they used
wheel data, either, did you?

------
schoen
Huh, I once visited a military base where people on the trip wanted to be
photographed with a tank. The soldiers said it was OK, as long as somebody
obscured the tank's serial number by standing in front of it. I wonder if
their training in this respect was inspired by this history!

(But if so, why not print the serial numbers inside the tank, not outside? Or
maybe encrypt or HMAC them?)

~~~
patmcguire
Apparently encryption is weak because the plaintext has a very simmple pattern
(increasing numbers).

[http://en.wikipedia.org/wiki/German_tank_problem#Countermeas...](http://en.wikipedia.org/wiki/German_tank_problem#Countermeasures)
[http://en.wikipedia.org/wiki/Known-
plaintext_attack](http://en.wikipedia.org/wiki/Known-plaintext_attack)

I can't understand HMAC enough to know whether it solves it, but there seems
to be a trade-off between keeping it secure and introducing randomness and
making people do lookups on the other end (which would have been super slow
then, and I don't know how feasable hash tables are when dealing with
punchcards)

~~~
sqrt2
A MAC of a message m can only be computed with the knowledge of a key K.
Specifically, with a cryptographic hash function h,

    
    
      HMAC(K, m) = h(K + a || h(K + b || m)),
    

where + is addition mod 2 (xor), || is concatenation and a and b are
constants. (This construction takes into account possible length extension
attacks on h.)

Given that h is secure, knowledge of any reasonable number of pairs (m,
HMAC(K, m)) does not allow you to recover K, and without K, you cannot compute
HMAC(K, m) for known m, i.e. enumerate all the possible MACs for serial
numbers.

------
IgorPartola
Don't remember where I read it at least 12 years ago, but someone talked about
an April Fools prank where they released three pigs in their high school, with
numbers 1, 2, and 4 written on them. Allegedly the administrators spent weeks
looking for number 3.

~~~
lotharbot
The "weeks looking for number 3" part sounds apocryphal to me.

Snopes [0] has one real example and one television example of this prank. In
the real example, the students were caught on camera, and there was no long
search for the remaining livestock.

[0]
[http://www.snopes.com/college/pranks/livestock.asp](http://www.snopes.com/college/pranks/livestock.asp)

~~~
IgorPartola
Thanks for actually researching this. As I remember, I read this bash.org, so
I was pretty skeptical of the story just because of the source. It's one of
those stories I'd like to believe was true just because I like it.

------
sbirch
My favorite explanation of this (posed instead as the Locomotive problem) is
in Allen Downey's "Think Bayes," pp.22

It's online too, and worth reading!

[http://www.greenteapress.com/thinkbayes/](http://www.greenteapress.com/thinkbayes/)

~~~
jmount
Book seems nice, but its discussion on the German Tank problem sets up code to
calculate posteriors from priors that are more detailed than the Bayesian
argument in the Wikipedia article. That is useful, but you don't get the
problem intuition that the answer is the maximum ID you saw inflated by a
factor depending on the expect ID-gap. There are some assumptions in the
Wikipedia Bayesian analysis, but they are less determining than the ones in
the book.

------
te
The frequentist and Bayesian analyses give different answers to the central
question. Which one is more correct?

~~~
_delirium
There isn't really 'a' correct frequentist or Bayesian answer to the problem;
it's more two different ways of thinking about the problem, which could well
get you the same numerical results (though they might not).

The frequentist way of thinking about it is to ask what you mean by "more
correct", i.e. what properties do you want an optimal estimator to have?
Another way of putting this is: if you were to set up a simulation where the
real answer is known and data is sampled, and then you judge estimators by how
close they get (according to some penalty function scoring closeness) when you
run this simulation 10,000 times, which estimator would score the best? The
estimator with the minimum variance of all unbiased estimators (the MVUE) will
do optimally under some definitions of optimal; the MLE is another one that is
optimal for other definitions. Note that they're both frequentist and give
different answers.

The Bayesian analysis of the situation is that it basically comes down to your
choice of prior: the observed information is not in itself sufficient to
produce a single "best" estimate, but rather you combine it with your prior
distribution to produce an estimate. The Bayesian could end up producing
exactly the same estimate as the estimators this article labels "frequentist".
The Bayesian argument would be that what these estimators are really doing
under the language of MVUE/MLE/etc. is implicitly choosing priors, whereas the
Bayesian would explicitly choose one. The Bayesian would also probably not
really like the simulation-experiment idea (which is a pretty directly
frequentist thought experiment).

~~~
bazzargh
_The Bayesian argument would be that what these estimators are really doing
under the language of MVUE /MLE/etc. is implicitly choosing priors_

I'm curious, which choice of priors corresponds to the frequentist answer? It
looks like it comes down to the distribution (n|k)? Seems like something that
must've been studied.

------
stillsut
This is why the whole secret agent "#3" thing in movies like Bourne Legacy,
James Bond etc are so ridiculous.

That's a worse code name then just using the person real name as it gives
hints of the total participation in the secret organization.

~~~
Tohhou
The name's Agent 127356 617521 468927 932557 316478 323374 395995 326630
491034 217268 274611 237524 337188 146754 679652 340738 635877 617517 992248
343367 515750 470302 132876 177599 156605 28370 149544 889133 470520 994752
52998 306348 827980 134251 494718 157786 643512 976648 676871 335476 314504
786821 432468 815692 537830 465962 245564 22239 948088 588642 356978 27525
7635 565720 138592 82302 437935 431429 66539 283187 428296 276837 52407 584999
385187 461461 616784 947454 981732 580233 239585 601256 943780 385422 669503
579611 619964 902619 999399 317012 976906 634968 515478 979532 360526 554976
333481 560942 196337 397222 19738 518392 842556 570366 142058 557450 818663
997306 239940 429107 Shaken not stirred.

------
therealmarv
Interesting. Especially because there is no German translation or Wikipedia
entry for this article.

~~~
frik
Unsurprisingly, ...German Wikipedia has (sadly) some badly behaved admins that
delete pages because the are not "relevant". These admins that delete pages
like at _random_ are called "Löschnazi" by the community [1].

My observation is that one has to resort to the English version, because
either the page is missing or it is biased to Germany (albeit German language
is also the native language of Austria, Switzerland, etc.).

German Wikipedia has of course also a lot of good efforts like the WikiData
and Geolocation sub-project, etc. Hopefully they can kick the badly behaved
admins soon... the reader to author ratio is already alarming.

[1]
[http://de.wikipedia.org/wiki/L%C3%B6schnazi](http://de.wikipedia.org/wiki/L%C3%B6schnazi)
of course the page got deleted as well so there is a backup:
[http://de.pluspedia.org/wiki/L%C3%B6schnazi](http://de.pluspedia.org/wiki/L%C3%B6schnazi)

------
RA_Fisher
I remember my theoretical stats teacher showing us this problem. It's used all
the time in ecology. His example used it to estimate the number of alligators
in Louisiana swamps. They tag the alligators, release, and then using the tags
they re-capture over subsequent years, they can get an estimate of how many
alligators exist in the wild!

~~~
taejo
The German tank problem applies when all the alligators come pre-tagged with
serial numbers :) Capture-recapture is a slightly different problem:
[https://en.wikipedia.org/wiki/Mark_and_recapture](https://en.wikipedia.org/wiki/Mark_and_recapture)

------
bane
So here's an idea. Conventional intelligence was off by quite a bit, spurring
the allies to overproduce tanks (which was possible due to the absurd American
industrial capacity), which then allowed the allies to cleanly overwhelm the
order of magnitude fewer tanks they actually came in kinetic contact with.

~~~
dhoulb
In case anyone is interested in the numbers, USA produced in the region of
50,000 Sherman tanks during the Second World War. 5-10x more than Germany.
Plus the British tanks. It didn't take all that long to reach Berlin!

~~~
nl
Actually, the US produced 102,000 tanks and self propelled guns, while Germany
produced 67,000.

Russia produced 105,000 which was the real difference.

[http://en.m.wikipedia.org/wiki/Military_production_during_Wo...](http://en.m.wikipedia.org/wiki/Military_production_during_World_War_II)

------
concernedctzn
This sounds like an excerpt from the Cryptonomicon, which I happen to be
reading right now.

------
auctiontheory
I first read about this work a few years ago, but had I encountered it before
college, I think I might have majored in statistics. Such powerful results -
feel like magic.

------
Pinatubo
Tank #7, you are now known as tank #22347.

Tank #1 and tank #22347, report to the commander for the orders related to
your suicide mission ...

------
ariwilson
I encountered a slightly different problem trying to find the size of the
union of a bunch of sets. We ended up just storing the smallest k int64 hashes
of each item in each set, and computing 2^64 / ((largest hash - smallest hash)
/ (k - 1) as an estimate of the size of the union.

~~~
robryk
This is a very old (and nice) problem in streaming algorithms. The solution
currently used in most places is HyperLogLog[1], which basically uses the
distribution of log(minimum value of hash) for a set of hashes.

[1]
[http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

------
germanTankPanzy
I think the most important information is the table: Month Statistical
estimate Intelligence estimate German records June 1940 169 1,000 122 June
1941 244 1,550 271 August 1942 327 1,550 342

Intelligence estimates... so off the mark.

~~~
twic
But these estimates were derived directly from a graph in the Germans' pitch
deck!

------
elwell
The Germans should have sanded off the serial numbers.

