Hacker News new | past | comments | ask | show | jobs | submit login
German tank problem (wikipedia.org)
231 points by fortepianissimo on Feb 21, 2014 | hide | past | favorite | 83 comments

This actually hit my previous company in a software context.

We would number our hotfixes sequentially. Many would be items demanded by a single client, so would get deployed as hotfixes only to that customer's site, and just rolled into the main trunk for the next quarterly release for everyone else. Clients would always be notified about hotfixes going onto their live sites.

One savvy client noticed the hotfix numbering sequence. Naturally, that ensued quite a number of extremely awkward discussions as they would regularly ask why our software needed so many hotfixes (tens per week) and why they weren't entitled to all of them right away.

Solution: a new policy to randomly generate hotfix numbers. Which of course led to the next problem, that now the sequence was not obvious from the names, so dependent hotfixes would sometimes get deployed in the wrong order. Why can't anything be easy...

Just name the hotfixes by day (140222). It monotically increases, and if you do multiple hotfixes in one day, suffix a/b/c etc. Generally you're unlikely to get up to b or c, and there's no clue to how many previous versions there's been.

Drop the a,b,c... just timestamp it.

Leads to longer timestamps though, when you include the HHMM. If you're generally not doing more than one per day, this makes it a little easier - usually six digits instead of ten.

If only we had access to some kind of machine that made dealing with long numbers easy.

Except that a hotfix number is a human-interaction number, not an automation device. Speaking as someone who has worked on the support phones, I'd much, much, much rather someone only have to read back a six-digit number than a ten-digit one.

That was proposed (with a numeric suffix rather than letter), but rejected because we often would do 3 to 5 in a day, and somebody decided that would still leak too much information about our hotfix workflow. Another aspect would be that hotfixes would sometimes go several days between packaging and deployment (for all your usual megacorp red-tapey risk-aversey reasons), and we didn't want to explain to clients why 20120214 wasn't deployed until 2012-02-21.

my goodness.

i'm sure you (or your employer for that matter) have better things to do than twisting your internal policies to client's dis/liking. just tell them to fuck off -- in diplomatic speak -- that's what your/a boss is (paid) for.

> why our software needed so many hotfixes

"well. software does not fall from the skies. humans are involved. we can stop fixing shit before you find out and complain, take your money and leave you with crackers. decide!"

I've worked in enterprise software and when customers are paying tens of millions of dollars per year for your software, it's difficult to tell them to fuck off... even more so when your customers all know each other.

Nobody said its easy, we are saying that it must be done or you and the customer will both be unhappy.

I think that far less software businesses fail due to telling their customers to fuck off than vice versa.

At least that is what my anecdotal evidence says. I know of a couple of software businesses that failed because they couldn't get themselves to say no to their customer.

I know of none that told their customer to fuck off and failed.

> i'm sure you (or your employer for that matter) have better things to do than twisting your internal policies to client's dis/liking

I'm sort of at a loss here. The customer is always right, right? If you're deploying software, everything the customer sees is what they're paying for. If it matters to them, and it bothers them (which I can see in this case), I think the good companies fix it, and the bad companies have "better things to do"

Customers always act entitled.

"The customer is always right" is an incredibly broken methodology. It works great for a mom&pop shop in a small town… not so much in any other situation.

we're talking about throwing ticket numbers around. and this fixes absolutely NOTHING except egos.

> The customer is always right, right?

no. if they talk bullshit, it's still bullshit. then to just comply and go with bullshit is the reason there's so much atrocity.

don't be sheep and talk to the other side like educated human beings do.

You could always have a fixed length randomly generated number prefix, with the numbers after that being sequential, making it obvious internally the sequence while obfuscating to the customer.

For example if you did a four digit prefix your first three hotfixes might be:

97321 33562 77493 ...

Then again… Why didn't you ship all your hotfixes to everyone?

What you need is random but monotonically increasing sequential version numbers. Add a random number to your last version number, and that is your new version number. ;)

That's still vulnerable to statistics though. I would just use timestamps.

Very good idea. Timestamps do not reveal how many hotfixes are made and are chronological by default. To be sure, there could of course be a central instance providing the stamps and ensuring that no two fixes get the same ... (when several fixes could be made in parallel)

There is some practical relevance to software development here. One shouldn't expose sequential IDs (a.k.a. serial numbers) to the public for anything non-public.

I see this Hacker News post has a numerical ID in the URL, for example; I can estimate the size of Hacker News given enough of these numbers... More directly, I can modify that numerical ID to crawl Hacker News.

Many sites do this; it's generally better to generate a (random or hashed or generated from a natural key) 'slug' to use as the key instead. For example, Amazon generates a unique, non-sequential, 10-digit alphanumeric string for each item in their catalog.

I'd advise against modifying that numerical ID. Recent history shows this is a punishable offense (hacking!) under the lovely CFAA and supported by the incompetent judicial system.

For example Facebook exposed the internal user-id, Mark Zuckerberg: http://www.facebook.com/profile.php?id=4

Other sites like Pastebin use a sequential ID too, but the convert number to another base (like from base 10 to base 43). It's shorter too. (I haven't found the link to the related stackoverflow discussion)

I joined in 2005, when you still needed to be part of a school which had a Facebook network. At that time the ID for most users was made up of a network number (which was in the 4-digit range by the time my school was added) and then a 5-digit user code. So you could (and still can, given the right info) figure out the first person to join from your school, as well as when your school got a network relative to some others nearby.

You can still use this to find some mundane information about Facebook's early userbase. For example, the first non-Harvard schools added were, in order, Stanford, Yale, Cornell, Dartmouth, U.Penn, Columbia, etc., and that the first people added to most of those networks either grew up with or attended the same high school as Zuck.

The flipside is that you can give off the impression of having a large user base/product catalog/etc if you number things sequentially... but start at a large non-round number.

It seems like one could use the same technique to estimate the initial (lowest-observable) serial number...

From the article:

  If starting with an initial gap between 0 and the lowest
  sample (sample minimum), the average gap between samples is
  (m - k)/k; the -k being because the samples themselves are
  not counted in computing the gap between samples.
Perhaps someone with a better grasp on the math can confirm that this makes 'obfuscating size by starting with a higher serial number' an ineffective mechanism?

Yes. If you're only looking at the gaps between the numbers, adding a constant offset to the serial numbers would have no effect on the estimate.

On the other hand, if instead of ordering them sequentually I roll a die and add the number of spots to the previous serial number, I think I can trick you into thinking I have three times as many tanks as I actually do. In fact, I feel quite confident of it.

If I find sufficiently many of your tanks, the distribution of the differences in serial numbers would start showing that we aren't talking about a random sample from 1…n

For example, having seen 250 IDs in the 1…1000 range and 200 in the 1001…2000 range, the next ID in the 1…2000 range I see should fall in the 1…1000 range with probability 750/(750 + 800) ~= 0.48 in the 'normal' case, and around 36/(36 + 86) ~= 0.30 with your method of doling out IDs.

And I think it would be a factor of 3.5 (the expected number of eyes on a throw of a die). That's why I expect your method to dole out 286 out of every 1000 IDs.

But it would require me from checking for this, depend on finding such discrepancies in samples, and increase the variance on my estimates for a given sample size.

I don't have the math (or time (or ambition!)) to confirm or deny this, but I think that it's not really meant to stand up to serious scrutiny - just enough for someone to glance at an invoice or a URL or a receipt and being more impressed/less wary. (See fecak's reply to my comment.)

Some businesses do this with invoice numbers to give the false impression of being more successful (more invoices should equal more sales and customers) than it actually is.

I did this myself with personal checks. I asked my bank to give me personal checks starting at some moderately sized number so that I wouldn't be giving anyone check #0000001.

The military do this as well. The British SAS when first formed was designation "L Detachment Special Air Service" as part of a disinformation campaign to persuade the Germans there were paratroopers, and at least 11 other detachments of the same size, in North Africa.

So there are ways to game the statistical analysis.

Nonsequential IDs can also help you avoid inadvertently baking in assumptions about your IDs being small, which are then broken as you scale up and cause sorrow and woe.

For example, Twitter uses sequential(ish) ID numbers for tweets. Havoc ensued when they crossed 2^32 tweets, because lots of software out there ended up being written with the assumption that a tweet ID could fit into a 32-bit integer. It was true for some years, and then suddenly it wasn't anymore, and everything broke.

Yes there are a huge amount of sequences out there with no special need of being sequential. Apple store invoices. UPS tracking numbers (sharded by shipper). Lots of similar things that can provide a glimpse into the inner workings of what are often publically traded companies with good reason to not leak such information.

For books, Amazon's ASIN is just the ISBN-10.

PHP's uniqid() uses time, so that's usually sufficiently indiscernible.

It's astounding how accurate they were using only statistical methods:

> Analysis of wheels from two tanks (48 wheels each, 96 wheels total) yielded an estimate of 270 produced in February 1944, substantially more than had previously been suspected.

> German records after the war showed production for the month of February 1944 was 276.

It wasn't exclusively statistical:

    The analysis of tank wheels yielded an estimate for the
    number of wheel molds that were in use. A discussion with
    British road wheel makers then estimated the number of
    wheels that could be produced from this many molds...

You can't do statistics without data. You aren't complaining that they used wheel data, either, did you?

Huh, I once visited a military base where people on the trip wanted to be photographed with a tank. The soldiers said it was OK, as long as somebody obscured the tank's serial number by standing in front of it. I wonder if their training in this respect was inspired by this history!

(But if so, why not print the serial numbers inside the tank, not outside? Or maybe encrypt or HMAC them?)

I had a similar experience and asked.

The answer was that when things like tanks get moved, it's by rail or ship. When figuring out what goes where, having to climb up on a railcar to figure out whose tank it was gets old.

The wiki article mentions estimations being done from parts from broken tanks, in which case internal numbers wouldn't necessarily help.

The Soviets used to semi-randomly change ship pennant numbers and other identifying marks on their equipment to throw off NATO efforts to develop accurate strength estimates, orders of battle, and movement pictures.

As a less intentional version of this - at early stages of the 1973 war, Israeli forces in certain areas tended to consist of whatever scraps of forces arrived at the front first. This led Egyptian and Syrian intelligence to wildly overestimate Israeli force sizes, since they would sometimes see the identifying marks of multiple brigades when facing a very small force.

> why not print the serial numbers inside the tank

I think by serial number you mean the unit number and markings stenciled on the tank?

Something like 'C Co/5th Bn/377th Rgt'

Because friendly soldiers need to eyeball that information at a glance. Climbing up and peering inside each and every vehicle is a time-waster.

Finally, an application for Google Glass: Tank identification. Paint on a camouflage QR code and we're ready to go.

FWIW, I believe the military already deploys mobile computers with HUDs like this.

Such a (GPS-based) system I've worked with in the past is called Blue Force Tracking.


I don't know whether they use HUDs, and whether they work with tanks, but the common name for that is a "friend or foe" system (http://en.wikipedia.org/wiki/Identification_friend_or_foe)

I'd guess that they didn't know if they were allowed to do that, and figured if there was no serial number then they couldn't get in trouble.

So, I HMAC the serial numbers of the tank. Now you have a series of very large numbers, evenly distributed.

Someone else can do the proof, but I'm pretty sure the total number of tanks is inversely proportional to the smallest difference between two observed HMACs (or between the highest observed HMAC and the highest possible HMAC).

EDIT: Actually, scratch that. That only works if you see all the tanks...

If the serial numbers are randomly distributed in some finite span, and you have the serial number of three tanks, say:

    a: 000001
    b: 000101
    c: 100000
What is the probability that c is not the highest serial number and that there are 1000 total tanks.

How many tank serial numbers would we need before those probabilities get > 90%, > 95%?

Apparently encryption is weak because the plaintext has a very simmple pattern (increasing numbers).

http://en.wikipedia.org/wiki/German_tank_problem#Countermeas... http://en.wikipedia.org/wiki/Known-plaintext_attack

I can't understand HMAC enough to know whether it solves it, but there seems to be a trade-off between keeping it secure and introducing randomness and making people do lookups on the other end (which would have been super slow then, and I don't know how feasable hash tables are when dealing with punchcards)

Why is encryption weak due to the simple pattern of increasing numbers? Encrypting increasing numbers is how CTR and GCM mode work; those are arguably some of the best modes of operation we have commonly available.

A MAC of a message m can only be computed with the knowledge of a key K. Specifically, with a cryptographic hash function h,

  HMAC(K, m) = h(K + a || h(K + b || m)),
where + is addition mod 2 (xor), || is concatenation and a and b are constants. (This construction takes into account possible length extension attacks on h.)

Given that h is secure, knowledge of any reasonable number of pairs (m, HMAC(K, m)) does not allow you to recover K, and without K, you cannot compute HMAC(K, m) for known m, i.e. enumerate all the possible MACs for serial numbers.

Don't remember where I read it at least 12 years ago, but someone talked about an April Fools prank where they released three pigs in their high school, with numbers 1, 2, and 4 written on them. Allegedly the administrators spent weeks looking for number 3.

The "weeks looking for number 3" part sounds apocryphal to me.

Snopes [0] has one real example and one television example of this prank. In the real example, the students were caught on camera, and there was no long search for the remaining livestock.

[0] http://www.snopes.com/college/pranks/livestock.asp

Thanks for actually researching this. As I remember, I read this bash.org, so I was pretty skeptical of the story just because of the source. It's one of those stories I'd like to believe was true just because I like it.

That's the version from 20 years ago. Those people have grown up and become school administrators, so it doesn't work as well anymore. The new plan is to release pigs numbered 1,2, 5,6,7, 10 and -2.

My favorite explanation of this (posed instead as the Locomotive problem) is in Allen Downey's "Think Bayes," pp.22

It's online too, and worth reading!


Book seems nice, but its discussion on the German Tank problem sets up code to calculate posteriors from priors that are more detailed than the Bayesian argument in the Wikipedia article. That is useful, but you don't get the problem intuition that the answer is the maximum ID you saw inflated by a factor depending on the expect ID-gap. There are some assumptions in the Wikipedia Bayesian analysis, but they are less determining than the ones in the book.

The frequentist and Bayesian analyses give different answers to the central question. Which one is more correct?

There isn't really 'a' correct frequentist or Bayesian answer to the problem; it's more two different ways of thinking about the problem, which could well get you the same numerical results (though they might not).

The frequentist way of thinking about it is to ask what you mean by "more correct", i.e. what properties do you want an optimal estimator to have? Another way of putting this is: if you were to set up a simulation where the real answer is known and data is sampled, and then you judge estimators by how close they get (according to some penalty function scoring closeness) when you run this simulation 10,000 times, which estimator would score the best? The estimator with the minimum variance of all unbiased estimators (the MVUE) will do optimally under some definitions of optimal; the MLE is another one that is optimal for other definitions. Note that they're both frequentist and give different answers.

The Bayesian analysis of the situation is that it basically comes down to your choice of prior: the observed information is not in itself sufficient to produce a single "best" estimate, but rather you combine it with your prior distribution to produce an estimate. The Bayesian could end up producing exactly the same estimate as the estimators this article labels "frequentist". The Bayesian argument would be that what these estimators are really doing under the language of MVUE/MLE/etc. is implicitly choosing priors, whereas the Bayesian would explicitly choose one. The Bayesian would also probably not really like the simulation-experiment idea (which is a pretty directly frequentist thought experiment).

The Bayesian argument would be that what these estimators are really doing under the language of MVUE/MLE/etc. is implicitly choosing priors

I'm curious, which choice of priors corresponds to the frequentist answer? It looks like it comes down to the distribution (n|k)? Seems like something that must've been studied.

I would guess that it mostly depends on the quality of the prior.

This is why the whole secret agent "#3" thing in movies like Bourne Legacy, James Bond etc are so ridiculous.

That's a worse code name then just using the person real name as it gives hints of the total participation in the secret organization.

The name's Agent 127356 617521 468927 932557 316478 323374 395995 326630 491034 217268 274611 237524 337188 146754 679652 340738 635877 617517 992248 343367 515750 470302 132876 177599 156605 28370 149544 889133 470520 994752 52998 306348 827980 134251 494718 157786 643512 976648 676871 335476 314504 786821 432468 815692 537830 465962 245564 22239 948088 588642 356978 27525 7635 565720 138592 82302 437935 431429 66539 283187 428296 276837 52407 584999 385187 461461 616784 947454 981732 580233 239585 601256 943780 385422 669503 579611 619964 902619 999399 317012 976906 634968 515478 979532 360526 554976 333481 560942 196337 397222 19738 518392 842556 570366 142058 557450 818663 997306 239940 429107 Shaken not stirred.

Interesting. Especially because there is no German translation or Wikipedia entry for this article.

Unsurprisingly, ...German Wikipedia has (sadly) some badly behaved admins that delete pages because the are not "relevant". These admins that delete pages like at random are called "Löschnazi" by the community [1].

My observation is that one has to resort to the English version, because either the page is missing or it is biased to Germany (albeit German language is also the native language of Austria, Switzerland, etc.).

German Wikipedia has of course also a lot of good efforts like the WikiData and Geolocation sub-project, etc. Hopefully they can kick the badly behaved admins soon... the reader to author ratio is already alarming.

[1] http://de.wikipedia.org/wiki/L%C3%B6schnazi of course the page got deleted as well so there is a backup: http://de.pluspedia.org/wiki/L%C3%B6schnazi

The germans didn't have to estimate the numbers, they just looked into their books!

I don't think it has any significance in that context to be honest. I've got a German background & have seen this posted a couple of times. Every time I think...of all the events/things in WW2 this is what you pick as interesting?

I remember my theoretical stats teacher showing us this problem. It's used all the time in ecology. His example used it to estimate the number of alligators in Louisiana swamps. They tag the alligators, release, and then using the tags they re-capture over subsequent years, they can get an estimate of how many alligators exist in the wild!

The German tank problem applies when all the alligators come pre-tagged with serial numbers :) Capture-recapture is a slightly different problem: https://en.wikipedia.org/wiki/Mark_and_recapture

So here's an idea. Conventional intelligence was off by quite a bit, spurring the allies to overproduce tanks (which was possible due to the absurd American industrial capacity), which then allowed the allies to cleanly overwhelm the order of magnitude fewer tanks they actually came in kinetic contact with.

In case anyone is interested in the numbers, USA produced in the region of 50,000 Sherman tanks during the Second World War. 5-10x more than Germany. Plus the British tanks. It didn't take all that long to reach Berlin!

Actually, the US produced 102,000 tanks and self propelled guns, while Germany produced 67,000.

Russia produced 105,000 which was the real difference.


This sounds like an excerpt from the Cryptonomicon, which I happen to be reading right now.

I first read about this work a few years ago, but had I encountered it before college, I think I might have majored in statistics. Such powerful results - feel like magic.

Tank #7, you are now known as tank #22347.

Tank #1 and tank #22347, report to the commander for the orders related to your suicide mission ...

I encountered a slightly different problem trying to find the size of the union of a bunch of sets. We ended up just storing the smallest k int64 hashes of each item in each set, and computing 2^64 / ((largest hash - smallest hash) / (k - 1) as an estimate of the size of the union.

This is a very old (and nice) problem in streaming algorithms. The solution currently used in most places is HyperLogLog[1], which basically uses the distribution of log(minimum value of hash) for a set of hashes.

[1] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

I think the most important information is the table: Month Statistical estimate Intelligence estimate German records June 1940 169 1,000 122 June 1941 244 1,550 271 August 1942 327 1,550 342

Intelligence estimates... so off the mark.

But these estimates were derived directly from a graph in the Germans' pitch deck!

The Germans should have sanded off the serial numbers.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact