
Model Explainability is The Next Data Science Superpower - headalgorithm
https://towardsdatascience.com/why-model-explainability-is-the-next-data-science-superpower-b11b6102a5e0
======
valgaze
This is interesting:

A Kaggle competition to predict loan defaults gives an extreme example. This
competition had 100s of raw features. For privacy reasons, the features had
names like f1, f2, f3 rather than common English names. This simulated a
scenario where you have little intuition about the raw data.

One competitor found that the difference between two of the features,
specifically f527 — f528, created a very powerful new feature. Models
including that difference as a feature were far better than models without it.
But how might you think of creating this variable when you start with hundreds
of variables?

The techniques you’ll learn in [OP's cours] would make it transparent that
f527 and f528 are important features, and that their role is tightly
entangled. This will direct you to consider transformations of these two
variables, and likely find the “golden feature” of f527 — f528.

~~~
bigger_cheese
One thing I have often found is that data context matters for a lot of
modelling work. It's all well and good to throw 1000 inputs named F_1...F_1000
into a model and see what gets spat out but I question how valid that is going
to be.

To give a recent example from my workplace: There is an online gas
chromatograph hooked up to a reactor - external modelling (of the throw 1000+
variables at the wall and see what sticks variety) showed the percentage of
nitrogen, as measured by chromatograph was strongly correlated to something
we'd asked them to investigate.

The people doing the modelling who had no experience with basic chemistry
latched onto this and started to get really excited they wrote a large report
full of recommendations about nitrogen.

When my manager saw this he burst out laughing - nitrogen is an inert gas it
plays no role in the reaction what the data was actually showing is that
nitrogen content varies in response to the concentration of other gasses (H2,
Oxygen and CO/CO2) changing - if you recall air is typically 78% nitrogen. We
don't control Nitrogen at all it's value is determined based on the other
gasses in the reactor.

I think this demonstrates pretty well how easily people can see something in
the data and jump to some conclusions that while well meaning lack a
fundamental understanding.

Model explainability is good but input explainability is just as important -
important question to ask are often.

-What am I feeding into this model? -What is the expected behavior of this input? (I don't know is a valid answer here) -Did the input behave as I expected it to in the model? -If Not why? (This is the interesting and potentially valuable part)

~~~
adrianratnapala
So if I am understanding this story, some people found a correlation and
mistook it for causation. An obvious no-no, albeit a easy trap to fall into.
That's assuming what you were looking for was a handle on how you could get
more or less of whatever it was you wanted.

But if you just wanted a diagnostic, then the data is telling you that the
nitrogen content is (in practice) such a diagnostic. Depending on what you
already knew, that might or might not be interesting or useful, but it is
hardly nonsense.

~~~
bigger_cheese
Yes that's essentially correct the N2 means something just not what the
analysts thought it meant. Their recommendations were misleading and could
have saved some time if they'd discussed it with someone familiar with
process.

------
dannykwells
I mentor individuals looking to transition into data science. I can say
without a doubt that those who do the best (get the best jobs) are those with
_domain expertise_ and who are able to leverage that expertise in a data
science realm.

Of course, domain expertise is hard to come by and takes years to develop. I
look forward to data science leaving far behind the notion that "you too can
become a data scientist with 3 weeks of Python + scikit-learn".

~~~
bilbo0s
> _" you too can become a data scientist with 3 weeks of Python + scikit-
> learn"..._

I don't think any serious practitioners were out there thinking in this
manner. But there is a gold rush right now, and any gold rush will attract its
share of charlatans.

~~~
XuMiao
In fact, many quantatitive researchers in their domain are already data
scientists without using modern tools like machine learning. Being able to use
these tools is only the beginning of a new practicing and advancing.

I consider it the responsibility of computer scientists and engineers to make
the tools better and easier. But unfortunately, it's that one has to become a
good computer scientist first then a domain expert.

We are not doing enough on innovations of tooling.

~~~
Jedi72
Computing people in general seem to have forgotten the whole reason our field
exists is to build tools for other people, the ones who do the "real work".
With the possible exception of video games nothing we do with computers is
useful or interesting on its own, its only point is to assist other fields. So
we can have all the advances in the world but if we cant make them into useful
tools, what was the point.

------
boxfire
Sensitivity analysis is the next data science buzzword rewrite of classically
important skills in model building.

~~~
hprotagonist
Just you wait until someone redefines hysterisis, too!

------
nonbel
>"What features in the data did the model think are most important?"

This is going to be more misleading than informative, since the feature
importance hierarchy can change drastically based on what other features are
included.

> "For any single prediction from a model, how did each feature in the data
> affect that particular prediction"

Ok, but I don't see how this will be comprehended as anything other than a
black box for the typical ML model.

> "What interactions between features have the biggest effects on a model’s
> predictions"

If there are still big "interaction" effects that means you should have done
better feature generation/selection.

~~~
iguy
> since the feature importance hierarchy can change drastically based on what
> other features are included

This sentence describes a disturbingly large proportion of social science
papers. After controlling for X,Y,Z (naturally!) then we find that A is the
leading case of B.

~~~
nonbel
Yes, precisely. It is a common practice in medical research as well.

------
minimaxir
Oddly this post doesn't bring up linear/logistic regressions, which are still
used extremely often today despite the rise of ML/DL _because_ the models are
100% explainable, and can therefore make business decisions as a result. Sure,
it may not win you Kaggle competitions, but there are more important things in
life.

Another "compromise" is using variable importance outputs from tree-based
models (e.g. xgboost), which IMO is a crutch; it says which variables are
_important_ , but not whether it's a positive or negative impact.

~~~
nonbel
> "the models are 100% explainable"

In my experience this is largely illusory. People think they understand what
the model is saying but forget that everything is based on assuming the model
is a correct description of reality.

Your typical case of linear regression isn't even close to a correct
description of reality, what gets included is largely arbitrary and due to
convenience. As a result, different people with different types of data can
get very different estimates for any features common to both models.

Also, stuff like this:

"I don't even think simple linear models are actually explainable. They just
seem to be. Eg, try this in R:

    
    
      set.seed(12345)
      treatment = c(rep(1, 4), rep(0, 4))
      gender1   = rep(c(1, 0), 4)
      gender2   = rep(c(0, 1), 4)
      result    = rnorm(8)
    
      summary(lm(result ~ treatment*gender1))
      summary(lm(result ~ treatment*gender2))
    

Your average user will think coefficient for treatment tells you something
like "the effect of the treatment on the result in this population when
controlling for gender". I get a treatment effect of 1.17 in the first case,
but -0.38 in the second case, just by switching whether male = 0 and female =
1 or vice versa."
[https://news.ycombinator.com/item?id=16719754](https://news.ycombinator.com/item?id=16719754)

~~~
minimaxir
That's fair. Explainable yet, but making sure the regression follows
statistical assumptions is another can of worms, and one ignored by the rise
of "easier" ML/DL.

~~~
nonbel
The thing is the ML models don't try to be interpretable. Instead people try
to find models that make correct predictions.

AFAICT, that is also the appropriate role for simpler methods like linear
regression, unless you have reason to believe your model is robust to any
missing "unknown" features and any included "spurious" ones. In that case you
are probably dealing with a model based on domain knowledge and not even
running a standard linear regression.

People (over-)interpret the coefficients all the time though, I actually think
this is a huge problem.

------
wwarner
I think it's more exciting than author admits. Sure, it'll increase confidence
and adoption, but it will really mean that the machine can be queried like an
expert on whatever corpus it was trained on.

------
curiousgal
..and the convergence of Data Science to Statistics continues.

~~~
cjmaria
In a similar vein, it looks like there will be a resurgence of complex systems
science and explaining "emergent" properties of complex systems.

~~~
hhs
That would be interesting, I wonder if that will build new nonlinear theories.

------
hhs
I wonder if data scientists engage in a triangulation* of their sources and
models to explain things?

*I refer to this sense of triangulation: [https://en.wikipedia.org/wiki/Triangulation_(social_science)](https://en.wikipedia.org/wiki/Triangulation_\(social_science\))

~~~
wrnr
Look into the provenance research that has come out of the semantic web
community. They have developed a model where triple statements behave like
short english sentences:

    
    
      :subject :object :predicate .
    

ex:

    
    
      :alice :loves :bob .
    

This statement by itself tells you nothing, did Alice say she loves Bob, or
was it Bob who said it is Alice who loves him, or did Carol see the way Alice
looked at Bob and concluded that she must love him, and what exactly is the
quantitive difference between this love she feels for Bob and my love for
chocolate.

To support this kind of information each term is annotated with meta-data like
document of origin, author, time and place.

ex:

    
    
      :s1 rdfs:type rdfs:Statement ;
          rdf:subject :alice ;
          rdf:object :loves ;
          rdf:predicate :bob .
      
      :a1 rdfs:type prov:Activity
          prov:wasAssociatedWith :alice ;
          prov:hadMember s1 .
      
      :alice rdfs:type prov:Agent .
    

In computer science they call this process reification and it could be the
first step into creating a model that takes the source into account.

~~~
hhs
This is very useful. I'll check out this vocabulary, wrnr, thank you!

------
crankylinuxuser
Its also model justification and defense. And even unprotected variables can
be shown to synthesize protected variables. If I was in the legal profession ,
that's how I'd attack that.

Again, data is toxic. Even if you think you're doing something safe, there's a
chance youre not.

------
drieddust
Correlation is not the causation is the common wisdom.

But this is exactly we do in DL. Yet this is exactly DL techniques seems to be
doing. I am amazed that it works so well.

I am not the expert. Can someone explain like ELI5?

~~~
Nasrudith
Well the thing is that correlations can be used but they need a correction in
the loop somewhere. The job of the human is to laugh and correct it. One
example was teaching computers to draw cats based on reference images - many
were lolcats of varying languages and it associated the floating fonts with
cat features resulting in a blurry eerie gibberish mess.

[https://mobile.twitter.com/goodfellow_ian/status/93740653074...](https://mobile.twitter.com/goodfellow_ian/status/937406530743287808?lang=en)

The two fixes are to either exclude the captioned examples from the set or to
add in a filter "this is caption text - ignore it".

