A Kaggle competition to predict loan defaults gives an extreme example. This competition had 100s of raw features. For privacy reasons, the features had names like f1, f2, f3 rather than common English names. This simulated a scenario where you have little intuition about the raw data.
One competitor found that the difference between two of the features, specifically f527 — f528, created a very powerful new feature. Models including that difference as a feature were far better than models without it. But how might you think of creating this variable when you start with hundreds of variables?
The techniques you’ll learn in [OP's cours] would make it transparent that f527 and f528 are important features, and that their role is tightly entangled. This will direct you to consider transformations of these two variables, and likely find the “golden feature” of f527 — f528.
To give a recent example from my workplace: There is an online gas chromatograph hooked up to a reactor - external modelling (of the throw 1000+ variables at the wall and see what sticks variety) showed the percentage of nitrogen, as measured by chromatograph was strongly correlated to something we'd asked them to investigate.
The people doing the modelling who had no experience with basic chemistry latched onto this and started to get really excited they wrote a large report full of recommendations about nitrogen.
When my manager saw this he burst out laughing - nitrogen is an inert gas it plays no role in the reaction what the data was actually showing is that nitrogen content varies in response to the concentration of other gasses (H2, Oxygen and CO/CO2) changing - if you recall air is typically 78% nitrogen. We don't control Nitrogen at all it's value is determined based on the other gasses in the reactor.
I think this demonstrates pretty well how easily people can see something in the data and jump to some conclusions that while well meaning lack a fundamental understanding.
Model explainability is good but input explainability is just as important - important question to ask are often.
-What am I feeding into this model?
-What is the expected behavior of this input? (I don't know is a valid answer here)
-Did the input behave as I expected it to in the model?
-If Not why? (This is the interesting and potentially valuable part)
But if you just wanted a diagnostic, then the data is telling you that the nitrogen content is (in practice) such a diagnostic. Depending on what you already knew, that might or might not be interesting or useful, but it is hardly nonsense.
It sounds like the modelers did exactly what they were supposed to. Just because N2 isn't directly controlled doesn't mean [N2] isn't a valid factor - even if its only mechanistic action is as a diluent. Its entirely possible an aspect of a reaction is sensitive to absolute reactant partial pressures as opposed to just relative concentration.
Just as often as modellers going nuts, abusing either the chemistry or signal-to-noise, I've also seen chemists argue that multivariate-modelling of some rather simple system is impossible or 'not science', relegating research to inefficient linear trial and error.
 For instance, if that is a Fischer-Tropsch reactor you are describing, then yes [N2] would likely go up with higher yield as your reactive gas species coalesce into heavier molecules.
The issue was their understanding of what the signal meant, they wrote a rather lengthy report full of nonsensical recommendations on basis of not properly understanding what the input variable represented.
edit: At least at my org we use multivariate a lot this is mostly primary metallurgy not pure chem eng though.
Here is a talk about times when that happens exactly:
We may have gone to the moon, but by and large scientists of today are just as susceptible to mistakes as prior generations that had "solid evidence" like how European ancestry correlated with higher IQ. As good as it feels to wrap ourselves in the warm fuzzy blankets of "being woke" and thinking we're somehow different, I'm going to keep remaining skeptical of the results until I know exactly how they were reached (and the black box of deep learning makes that a little tricky).
"Early on I said I wouldn’t be giving any ethical prescriptions.I will, however, give one meta-ethical prescription: formalize your ethical principles as terms in your utility function or as constraints.It is nearly certain that tradeoffs between these principles exist, and if we don’t acknowledge this, we run the risk of unknowingly engaging in bad actions."
For explanations, key is having the input features be as intuitive as possible. So if a subtraction of features is more intuitive than the two feature separately then the explanations will be more intuitive as a whole
I read through the thread where f527 — f528 was discovered, and mostly the finding was that those two features alone were very predictive. Further research found that the two features are highly correlated, but negative of each other.
See the write-up of the winner here:
>Yasser Tabandeh posted a pair of features, f527 and f528, which can be used to achieve a very high classification accuracy (). Furthermore f527 and f528 are very highly correlated. However, it is not needed to keep both features. Their difference f527 − f528 contains the same amount of information
He also explains his method of finding other pairs of "golden" features in a 2-step iterative process.
Original thread: https://www.kaggle.com/c/loan-default-prediction/discussion/...
I'm not sure I understand. If it is discovered, or "found" (by the media, I guess), that "these systems" are biased, I suppose that should result in them, being fixed, even at the expense of them "tak[ing] off." In your view, is that a good thing or a bad thing? Is it better or worse that "the media" find it? Also, by "normative historically," did you mean "representative?" In other words, do you mean a mathematical norm, as in an average? Or an ethical norm, as in a rule? (Hetero-normative, normative ethics, and so on.) Surely we're not talking about "correct biases?" It's just that my impression is that the language of "normative" in this context, a context which reads as sociological, is most often associated with societal, cultural, or ethical "norms," which are things that are viewed as correct in an "ought" sense.
> Just look at the contentions of prison reform automated systems and the like.
Can you elaborate? Because I think very few of us have looked at these examples. Which "prison reform automated systems and the like" have failed to "take off" due to a discovery of bias by "the media?"
Sorry for all the scare quotes. I'm just trying to piece it together and am having no luck.
And it explains that these types of things "being fixed" is not really possible in a way that doesn't screw some group over ... you just kind of have to decide who you want to screw over.
Have a read it is very interesting.
The other reply answered the rest of the question as I see it, so I will respond to this part. In your language, I was referring to the representative demographics of some social class or occupation or the like, though demographics fitting to historical ethical norms would apply here as well. And all of this is with respect to Western Democracies such as America, of course.
I should also mention here that when we weight these AI scales with the goal of maintaining arithmetic equality amongst demographics, everyone loses. In particular, justice dies for the many when this becomes the case, and it is indefensible to argue otherwise.
It’s akin to someone giving you a bunch of anonymous variables to predict stock prices. You would throw out irrelevant variables like the fog level that day, or moon cycles of course. But you would never know if they were anonymous variables. thus you would blindly feed in garbage in your models.
Are you concerned about accuracy or privacy? You really can’t have it both ways, imo.
Of course, domain expertise is hard to come by and takes years to develop. I look forward to data science leaving far behind the notion that "you too can become a data scientist with 3 weeks of Python + scikit-learn".
I don't think any serious practitioners were out there thinking in this manner. But there is a gold rush right now, and any gold rush will attract its share of charlatans.
It’s a dynamic that sadly burns out overly optimistic, but smart data scientists and sadly leaves a negative impression on existing practitioners on the promise of ML. Those practitioners stick their head back in a hole instead of innovating.
All to say with more explainable ML (and more humility on everyone’s part) more progress would be made.
I consider it the responsibility of computer scientists and engineers to make the tools better and easier. But unfortunately, it's that one has to become a good computer scientist first then a domain expert.
We are not doing enough on innovations of tooling.
But a lot of thought pieces / YouTubers are pushing it, which is a problem.
I always took data scientist to be a rebranding of statistician. (To be clear there's nothing wrong with rebranding.)
I can buy that there’s a shortage of statisticians and so industry needs other people to pitch in, but it seems like the floodwater of incoming students should be directed to stats programs and not data science ones.
ML is hip and profitable.
Hoenstly, it's mostly garbage. The original notion of data scientists was invented by FB for a very specific set of skills (social science PhD's with Map-Reduce and experimental design), but it's a cool title and thus it got spread across multiple roles.
It's super weird though, despite doing data sciencey work for about a decade now, when I changed my title on LinkedIn to be data scientist, I started getting offers for jobs that paid a lot more money, so there's an incentive on the candidate side to re-brand.
But yeah, ultimately all the job is is some stats, some code, and some communication. Don't get me wrong, its a great job and its hard to find people who are good at all of this, but in my experience the limiting factor is definitely the statistics and the domain knowledge rather than the code.
Do you have an example?
But more classically, any structured statistical model with (eg) terms for variable interactions, measurement noise, and hierarchy.
This is going to be more misleading than informative, since the feature importance hierarchy can change drastically based on what other features are included.
> "For any single prediction from a model, how did each feature in the data affect that particular prediction"
Ok, but I don't see how this will be comprehended as anything other than a black box for the typical ML model.
> "What interactions between features have the biggest effects on a model’s predictions"
If there are still big "interaction" effects that means you should have done better feature generation/selection.
This sentence describes a disturbingly large proportion of social science papers. After controlling for X,Y,Z (naturally!) then we find that A is the leading case of B.
A strong example is a classifier that distinguishes between wolves and dogs. When you look at which pixels were the most impactful in 'Wolf' predictions, you can see that it is actually pixels of snow that are leading to predictions of 'Wolf'.
It's still a bit of a black box, but now it's a black box with an obvious, measurable flaw that you can work to address.
Yes, but how common is a situation like this? I mean your classifier is then not much better than making a histogram of pixel intensities.
Another "compromise" is using variable importance outputs from tree-based models (e.g. xgboost), which IMO is a crutch; it says which variables are important, but not whether it's a positive or negative impact.
In my experience this is largely illusory. People think they understand what the model is saying but forget that everything is based on assuming the model is a correct description of reality.
Your typical case of linear regression isn't even close to a correct description of reality, what gets included is largely arbitrary and due to convenience. As a result, different people with different types of data can get very different estimates for any features common to both models.
Also, stuff like this:
"I don't even think simple linear models are actually explainable. They just seem to be. Eg, try this in R:
treatment = c(rep(1, 4), rep(0, 4))
gender1 = rep(c(1, 0), 4)
gender2 = rep(c(0, 1), 4)
result = rnorm(8)
summary(lm(result ~ treatment*gender1))
summary(lm(result ~ treatment*gender2))
AFAICT, that is also the appropriate role for simpler methods like linear regression, unless you have reason to believe your model is robust to any missing "unknown" features and any included "spurious" ones. In that case you are probably dealing with a model based on domain knowledge and not even running a standard linear regression.
People (over-)interpret the coefficients all the time though, I actually think this is a huge problem.
*I refer to this sense of triangulation: https://en.wikipedia.org/wiki/Triangulation_(social_science)
:subject :object :predicate .
:alice :loves :bob .
To support this kind of information each term is annotated with meta-data like document of origin, author, time and place.
:s1 rdfs:type rdfs:Statement ;
rdf:subject :alice ;
rdf:object :loves ;
rdf:predicate :bob .
:a1 rdfs:type prov:Activity
prov:wasAssociatedWith :alice ;
prov:hadMember s1 .
:alice rdfs:type prov:Agent .
Again, data is toxic. Even if you think you're doing something safe, there's a chance youre not.
But this is exactly we do in DL.
Yet this is exactly DL techniques seems to be doing. I am amazed that it works so well.
I am not the expert. Can someone explain like ELI5?
The two fixes are to either exclude the captioned examples from the set or to add in a filter "this is caption text - ignore it".