
The Broken Chain Problem - headalgorithm
https://seekingquestions.blogspot.com/2017/03/four-parables-one-lesson-broken-chain.html
======
jiggawatts
You see this in every field, including IT of course.

My favourite example of this was the introduction of the Microsoft Distributed
File System (DFS) service into a customer environment. We were only using the
DFS Namespaces feature, which is like DNS for file shares: the client
downloads a blob of metadata and uses it to find an underlying file share for
a logical name. The caching is very effective and the additional delay is
negligible.

They had a file performance problem. So of course, they point the finger at
the last thing that has changed in the environment: DFS! It must be DFS! I
calmly point out that if the performance is bad _after_ a connection has been
made, it can't possibly be DFS, it's out of path after the initial
negotiation. It's like blaming DNS for slow downloads. Makes no sense! But
nope. They just won't accept that. It's the last thing that has changed!

I came back a week later to discover that instead of 2 redundant DFS servers,
there are now 6 redundant DFS servers "for performance". To their surprise
however, the file shares are still mysteriously behaving poorly.

They dragged me along to a meeting with a cast of thousands to discuss the
issues. One guy spent most of the time arguing for increasing the number of
DFS servers to 8, 10, or perhaps even more to solve the issue!

Meanwhile I'm having a side-conversation with the storage guy, who sheepishly
admits that 3 out of 4 fibre channel paths were down, and this started at
about the same time frame as the deployment of DFS. I point this fact out to
the room. Everyone looks at me while blinking slowly. A few more moments of
silence pass. Then someone helpfully suggests adding more CPU and RAM to the
DFS servers. Maybe 8 processors and 64 GB will do the trick!

~~~
selestify
I don't understand, why do things like this happen? Is someone emotionally
invested in upgrading the DFS servers?

~~~
jiggawatts
In this particular case it was "politics" in the smallest sense of the word.
The head of the storage team did not get along with the head of the infra
team, so the infra guys would solve the problems they could solve, ignoring
any storage subsystem problems because that was a "dead end".

This is exactly like the boy looking for the coin under the light. It feels
like it _could_ be productive, versus definitely not being productive. It
doesn't matter that logically it won't work, it's the feeling that matters.

------
yalooze
I immediately thought of 2 related points:

The Scream Test: If you see something and you don't know what it does, remove
it and see if anyone screams.

Chesterton's Fence [0]: "reforms should not be made until the reasoning behind
the existing state of affairs is understood" (Though the purpose of this fence
is not obvious, there may be valid reasons for its presence.)

[0]
[https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence](https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence)

~~~
kuu
Uhm, it's interesting, I see a little contradiction in your two points...

~~~
coldtea
Well, it's about balance.

If you applied the "Scream" test for a component you don't know what it does
in an airplane, it might fall/explode/whatever.

If you do it in a non-critical codebase, it might just break the build, and
have someone call you for it.

In the first case, the Chesterton's fence caution is more advisable. In the
second, you could go with a "remove it and see what happens" approach.

~~~
jchanimal
There were a few times in the development of Apache CouchDB where whole
features were “accidentally“ removed as a way of pointing out that the test
suite didn’t exercise them.

------
andi1304
"When you want to influence the world around you, make sure that your action
is causally connected to whatever you want to influence."

How can we know when an action is causally connected to an effect? Western
philosophy since David Hume has struggled to find a solution to this problem.
The 'problem of causality' is that it is extremely difficult to establish
cause and effect with any degree of certainty. All we have is correlation.
Modern thinkers (e.g. Judea Pearl) are still grappling with this problem
today.

Pearl's fascinating talk on this subject at PyData 2018 is available on
YouTube:
[https://www.youtube.com/watch?v=ZaPV1OSEpHw](https://www.youtube.com/watch?v=ZaPV1OSEpHw)

~~~
twic
It only seems to be difficult for philosophers.

~~~
earthboundkid
Like many philosophical problems, regular people solve the problem of
causality by plugging up their ears and saying "lalala, I'm not listening."
And then we get crap like analytics marketing teams.

------
amelius
Meh. If you take the average of everybody's nose length, you still have a
better estimate than just pulling a number from thin air.

Also, the inhabitants' action is basically what scientists do all the time
when they run experiments. Correlation is not causation, but quite often it is
a good lead.

~~~
yosamino
They didn't average the length of everyone's nose.

They averaged everyone's _estimate_ of the emperor's nose.

------
Gusmann
Sometimes it's hard to identify a broken chain on the spot and a lot of
marketing tricks are based on that. Highly recommend you to read Freakonomics
if you haven't yet.

Also this article mentions two observations of Richard Feynmann and I just
can't stop admiring the curiosity and wit of the man. His biography totally
inspired me and made me laugh a lot

------
hyperman1
3 of these stories are on HN regularly, but the soviet one was new to me.

Now I wonder: what should the central planners give as metrics? Is there
anything that works?

~~~
hoseja
Obviously, number or weight of standard-compliant nails.

~~~
dagw
But then you need someone to test the nails for standard-compliance, and what
metrics should you give them to make sure they do a good job...

~~~
subroutine
You can revert to a count-based performance target, and test for compliance
based on parcel weight

(e.g. a 2-inch finishing nail weighs 0.79 g; if a factory reports 1 mil units,
take a random .8 kg sample, check if there's ~1k nails).

~~~
dagw
Then the factory will just prioritize making sure all their nails weigh the
same over any other consideration.

~~~
subroutine
They should be prioritizing count (I mentioned that was the performance
metric). They actually wouldn't need to "make sure all their nails weigh the
same", only that on average they weigh as much as the template. If you can't
save time by using less material, and you can't save time by generating fewer
units, might as well make the foundry mold match the template, no?

------
nwsm
I look forward to analyzing the sequence of physical forces that connect me
wanting to tell someone something to keys being pressed on a keyboard to an
email showing up in their inbox as part of my decision process at work.

We don't have time to analyze entire chains so we make assumptions about
chains that should be unbroken. Occasionally but inevitably we make wrong
assumptions.

------
jancsika
> but none of the townspeople know anything at all about the emperor’s nose,
> so the causal chain from the actual emperor’s nose to the elders’ estimate
> is broken.

They do. They are humans, and humans have noses.

Using the statistical method decreases the chance that a wild guess by the
sculptor ends up angering the Emperor and causing deaths in the village.

~~~
sudhirj
I don’t think the sculptor was planning to sculpt an outlandish alien nose.
The problem is that if your statue doesn’t look right heads will roll.

The story isn’t about the selection of outlier data points as being
representative - its more about drawing conclusions through seemingly
plausible sounding methods that are nevertheless completely disjoint from the
thing you’re reasoning about.

