Hacker News new | past | comments | ask | show | jobs | submit login
Interpretability in Machine Learning: An Overview (thegradient.pub)
182 points by atg_abhishek on Nov 30, 2020 | hide | past | favorite | 46 comments

This is a good exposition of some formal definitions for 'interpretability' in the context of machine learning, but I am still not really clear on why such a property is necessary or even desirable in the context of high dimensional statistical learning algorithms. In some sense the power of modern machine learning (as opposed to a set of heuristics + feature engineering + a linear classifier) is that it is not limited by what its designers are able to imagine or understand. If it were possible to give a simple explanation of how a high dimensional classifier works then it would also likely be unnecessary to have so many parameters.

As an example, if we consider natural language processing, then we might say that we want our NLP algorithm to be interpretable. This is clearly a tall order since the study of linguistics is still full of unsolved riddles. It seems silly to insist that a computational model of language must be significantly easier to understand than language itself. If interpretability is not feasible with language - a construct that is intimately connecting to the faculties of the human brain - then why should we expect it to be feasible (or desirable for the wide range of applications that do not come naturally to people?

As a counterargument: it's precisely because of high-dimensional statistical learning that interpretability is a valuable trait. Yes, the power of modern ML is that it can handle situations that the designers did not explicitly design for--but this doesn't necessarily mean that it handles them well. For example, if your approval for a loan is subject to an AI and it rejected you, then you want to know why you were not approved. You'd want the reason your application was not granted to be something reasonable (like a poor credit history) and not something like "the particular combination of inputs triggered some weird path and rejected you offhand." Another example is machine vision for self-driving cars. You want the car to understand what a stop sign is and not just react to the color, otherwise the first pedestrian with a red jacket will bring the car to a screeching halt. Even though you may not have had red jackets in your training set (or may not have had enough so that misclassifications ended up contributing to your error percentage), you can verify the model works as intended using interpretability.

It's dangerous to treat this sort of models as a black box, as the details of how the model makes a decision is as important as the output; otherwise, how could it be trusted?


This topic is the subject of my thesis, so i am currently steeped in it. Let me know if I can answer any more questions!

While I accept your counterargument (esp. regarding credit scores, autonomous vehicles and the like) I think interpretability in deep models is only meaningful if you have accurate, objective labels. But how often is this the case? For example, when people refer to concepts, they often rely on some established culture, a common understanding, rather than things you can measure. Let's take the simple example of classifying spoiled fruit at the grocery store. You can train a ConvNet and it will probably learn to recognize some visual traits of "spoiledness", but how objective can it really be, given that humans don't always agree what "spoiled" really means? In other words, fruit is spoiled only when a large enough social group says it is spoiled. So if that is your label, then the model can only reflect the shared understanding of "spoiledness" in the given social group. Then interpretability can help you check if the model is looking at plausible things (e.g. the fruit, not the background). However, this won't really tell you "why" the model thinks this fruit is spoiled. You can draw an interesting analogy between this and ensemble models, where the "social group" of the models in the ensemble forms the shared culture.

Also and interesting interpretability approach that the article did not mention is SHAP[0]

[0] https://github.com/slundberg/shap

Regarding accurate labels, a lack of appropriate or sufficient labeling will knock over your model regardless of how powerful your model is (interpretable or not). This is where other benefits of model interpretation come in—you can spot potential errors in your model's training, and that gives you an indication that you need to re-evaluate your base assumptions about the data.

SHAP is cool technology! It looks like it builds off LIME and similar to fit a hyperplane against the model surface. I'm not surprised the article didn't mention it though, as it's a bit in the weeds for an overview piece.

You could say the same thing about a person’s reasoning. If someone were to think a fruit is spoiled, does that say more about the fruit or about their understanding of spoiledness as you put it?

That argument falls apart as you leave the simplest elements of a problem. Sure, stop signs are important but that’s such a basic aspect of driving as to basically be irrelevant. All self driving AI’s are going to get really good at identifying stop signs very early in development, it’s stuff like mirages that designers may never consider that make or break such systems.

The easiest elements for humans to interpret are therefore the least important as they’re going to have plentiful training data and people checking for failures. Interpretability is therefore only really useful for toy problems that don’t actually need machine learning.

So, the ethics argument for ML explainability may not be particularly strong (although, once the regulators arrive that isn't going to matter), but the practical argument is extremely strong.

Any explanation method will give you more insight into how your model, features and data interact. This will allow you to improve the model, and also avoid insane features that are just capturing either information from the future or are not worth the processing effort.

Maybe it's that I came to data science from a theory-driven science point, but I don't really understand why you wouldn't want to be able to interpret your model.

> Maybe it's that I came to data science from a theory-driven science point, but I don't really understand why you wouldn't want to be able to interpret your model.

Because it limits your model to doing things you could understand, which thus makes it less powerful. In other words it’s a useful property only so far as you get it for free. A self driving car AI being understandable isn’t worth it killing more people in the field.

There's no proof that uninterpretable models perform better, it's just a recent observation.

In some sense, it's possible to interpret any model, the effort requires just varies.

On your self driving car example, one which you can explain it's decisions to regulators is much more likely to be approved than one which you can't.

Given the amount of models I've built, I'm not sure there is a scarier thing than an uninterpretable model moving a ton of metal around at high speed.

> There's no proof that uninterpretable models perform better

It’s a vastly larger solution space. So it’s really the reverse that would be surprising.

> In some sense, it's possible to interpret any model, the effort requires just varies.

Models can be of arbitrarily large sizes to the point where people really can’t understand them. How do you go about dissecting a 2 layer NN with 10^40th nodes?

> Models can be of arbitrarily large sizes to the point where people really can’t understand them. How do you go about dissecting a 2 layer NN with 10^40th nodes?

I'd probably take the output of the first layer, and cluster it.

It would very much depend on what this NN was attempting to do.

Like, all of the work in this field does suggest that people value this feature, and its incredibly useful for debugging which is generally pretty hard in ML/statistics.

> It’s a vastly larger solution space. So it’s really the reverse that would be surprising.

Imagine a world in which linear models returned the entire matrix by observation rather than the coefficients. People would argue that it was uninterpretable, but it's a problem of tools.

I actually think that if you can instrument a model appropriately, then you can definitely build an interpretation layer on top of it. Clearly that doesn't make the model perform worse.

Even if you can't instrument it, you can run thousands of experiments changing one feature at a time and then estimate the impact of this feature on the model. Granted, that's not practical on many problems, but neither were deep NN's a decade ago.

Do you have any material you can recommend as an introduction to interpretation of NN systems, or how to design NN-based ML to be good for interpretation?

My background is more classic ML and computer vision (features design, geometry, photogrammetry) and I've been very hesitant to use NN in a number of situations precisely because of things like auditability and being able to fix bugs in critical-for-customer edge cases.

R Guidotti et. al[0] wrote a good literature survey on black-box explainers, and contains a summary table on page 20 of the current state of the art.

In terms of designing NN-based ML, the above paper has some info and this paper by S Teso[1] is a good place to start looking further (though it is focused on XAL). SENNs are cool, but ultimately most inherenly interpretable models come down to classic ML (decision trees, linear/logreg, etc.) which is limiting compared to the power of NNs. Post-hoc explanations are basically the only option (esp. for DNNs).

[0] http://arxiv.org/abs/1802.01933

[1] https://www.semanticscholar.org/paper/Toward-Faithful-Explan...

The loan example comes up a lot but I'm not sure why. We already use machine learning to evaluate credit ratings in the US. It is not always a fair system but it is standard practice and nobody asks why they are turned down for a loan or demands to know how the system works (this information is proprietary and banks would claim their algorithms are a trade secret since a better algorithm gives them an edge on pricing loans).

>>You'd want the reason your application was not granted to be something reasonable (like a poor credit history) and not something like "the particular combination of inputs triggered some weird path and rejected you offhand."

This is a low dimensional bias. If there is increased risk of default from high A & B & C & D but not high A or high B or high C or high D, then the combination or parameters is what matters even if it is not easy to explain. Typically in a high dimensional space most of the volume is far from the axes so it is unlikely that things will line up along some preconceived set of inputs. As it is 'poor credit history' is in fact an index that amalgamates a large number of different parameters so I'm not sure if that really explains why the loan was rejected or simply gives a simple name for a complicated thing.

In general yes, it is good to thoroughly debug any ML algorithm and make sure that is is doing roughy what you think that it is doing. A lot of times this process can be quite complicated and relies on a lot of intuition & heuristics. While thoroughly testing a ML solution is certainly best practice, I'm not sure if having a highly skilled researcher conducting an in-depth mathematical analysis of an algorithm would really make it 'interpretable'.

I had experience developing automated underwriting models in the insurance industry. As the models are becoming more sophisticated and adapting machine learning, heavier scrutiny is coming from regulators.

For good reasons, guardrails are required to not only protect against discrimination, but also so proxies can't be used either. For example, we can't include race in health and life underwriting decisions. But since zip code is highly predictive of race, that attribute must also be excluded.

I'm not familiar with banking regulations, but imagine similar policies are applied. In these cases, being able to demonstrate that a model isn't discriminating is not only ethically important, but in many cases is legally required.

It's more than discrimination in banking, you need to show that there are no side channels where insider information might be fed in.

This comment is very, very true.

While you as a person may not be able to get details of the algorithms and methods used, the regulators get exhaustive documentation about every aspect of them.

With GDPR, explainability is going to become a requirement, but it will probably take some test cases before this happens.

I’m pretty sure that despite the fact that statistical models are used in credit scores, those models are subject to regulatory review. For example, I’m pretty sure a “black, therefore, no loan” algorithm is illegal, regardless of whether that parameter was learned or programmed. It’s a huge problem in some of these models that latent racism can bubble up into a putatively unbiased algorithm. In some industries, including finance, I think that can expose to some regulatory risk, hence explainability being quite helpful.

> This is a low dimensional bias. If there is increased risk of default from high A & B & C & D but not high A or high B or high C or high D, then the combination or parameters is what matters even if it is not easy to explain.

However, even if we expect that a non-obvious combination of parameters will matter, we usually expect the hyperplane of our predictions to be at least a little bit smooth in various ways: monotonic or curved instead of jagged, small changes in input should cause only small changes in output, etc. Not just to make it easier to understand, also because the kinds of processes we study tend to behave that way.

For regions of high density, machine learning does exactly what you say it does: generate high-quality predictions or categorizations, even if the particular path that led it there is nonobvious or weird. But these models are generally not sensitive to how they categorize or predict unusual combinations of inputs and as to predictive quality for those edge cases, all bets are off. A very simple case is polynomial regression, which can be tuned to perfectly fit the training data but outside of the training set might oscillate wildly or go to infinity -- and this isn't really the result of overfitting, it's just what polynomials do.

The loan example comes up because of a long history of explicit discrimination in the US against Black borrowers (https://en.wikipedia.org/wiki/Redlining). Because of that history, people are justifiably skeptical when someone says: "I know that this industry discriminated against you in the past, but trust me, this new and completely opaque system is TOTALLY not going to discriminate against you anymore." Especially when the new systems have demonstrable racial bias (https://www.realtor.com/news/trends/black-communities-higher... among many many other sources).

What if an interpretable model is worse at telling stop signs from jackets than an uninterpretable model? Should we use the worse model because we value interpretability?

This is the type of hypothetical that kills the discussion, though.

If the model is interpretable, you have a high chance of knowing why it does or does not tell a stop sign from a jacket. If it is not, you only know that in your test/validation set, it can do the job.

Even tasks that machine learning clearly excels at is currently in a state where all good uses of it has a human supervisor at some level. Recognizing faces, as an example. For my personal library, I absolutely have to disambiguate the recognized faces of my kids as they get older in all of the products I've used.

If we value interpretability for the particular model, e.g. as in the loan example or where by law you have to make sure race was not a consideration, I'd say yes. In places where interpretability has no additional value, than of course no.

But it of course depends on the exact value trade-off, which any model designer already has to consider.

Yes, because in the interpretable model, the fix can also let you check robustness against flags and mailboxes at the same time. Not everything is correctly captured in the test dataset, so we need levels of abstraction that let us be more general.

You can use an uninterpretable model in conjunction with a post-hoc explainer—and in fact, this is most often how explainers are used. This gives you the best of both worlds: powerful models and auditability for their decisions.

It is common in my experience to catch algorithms ”cheating” by using interpretability methods. Interpretability methods are useful tools for debugging models that appear to be performing well but may in fact be using an irrelevant bias in your dataset that will not generalize.

> This is a good exposition of some formal definitions for 'interpretability' in the context of machine learning, but I am still not really clear on why such a property is necessary or even desirable in the context of high dimensional statistical learning algorithms.

Because the models can fail, and we want to know how to prevent them from failing further. In a pure black box model, we know about the test and validation runs, and not much else. When Google thus deploys a model that classifies accounts as toxic or not and then cancels the toxic ones, regardless of how many domains you manage or YouTube followers you have or even whether you have a YouTube TV subscription, you'd prefer knowing why the model chose to give you the axe. You might even prefer a "human in the loop" when the system makes a call but doesn't really have confidence.

For certain areas like NLP, sure, it'd be tough. But for CV tasks or many other ML tasks, some form of explanation would be invaluable and much more (human) user-friendly.

> I am still not really clear on why such a property is necessary or even desirable in the context of high dimensional statistical learning algorithms

Law, regulations, and human trust.

In my experience people will come at this from a variety viewpoints. Typically they (1) don't trust the model to learn something useful, so they want some confidence that the model isn't going to do exotic things with new inputs (i.e. they want some faith in the generalization ability of the model to unseen inputs) or (2) they want the model to help them understand the problem. Your language example is perfect. How nice would it be for linguistics if a complex model could tell you about simple structures that you didn't previously know about. It is nearly an article of faith with some people that these simple structures must be there generating the statistics we see, like they haven't considered the possibility that there might not be a simple structure underneath.

Many industries such as insurance have legal requirements that prevent the use of many black box methods.

Scientists using ML for research often wish to understand their subjects, and interpretable ML would probably be more likely than non-interpretable ML to help improve understanding.

“hey this dnn said this image has a faulty sensor in it. why? is it because it’s correctly spotting the fault, or is it that random cluster of irrelevant 12 pixels over there?”

For the nlp example, I think the goal would be more of a reflective model. That is, not so much one that we can interpret by inspection of the state, but at least one that can expand on its state in the form of "why?"

This has actually been my biggest complaint against the smart speaker craze. Often I just want to ask, what did you think I said? Or, why did you activate? To some extent, the partner app allowed this. Is very limited, though.

One reason to want to understand what some "black box" is doing is distribution/dataset shift, especially in medical applications.

For example, suppose you're building a neural net to detect early-stage lung cancer on medical imaging, and you test/train it on patients in a small set of hospitals. Often the hospital name is given on the image, and this can be used as a covariate to improve accuracy (due to the hospitals serving different populations with different demographics). But a model that does this may suffer when put into production at other hospitals.

Some real-life examples with this flavor are given at the end of these slides: https://mlhcmit.github.io/slides/lecture10.pdf.

See also Section 6.3 of this paper for how interpretability can help choose models with superior generalization: https://arxiv.org/abs/1602.04938.

But. Don't you wanna learn? I love implementing interpretability papers on my CNNs. It's very revealing. I also like also just toying with monte-carlo methods or nearest-neighbors to see the origin of a decision or 'how close' it was to decide otherwise in each dimension. If only to detect some unknowns in your training set.

Apart from the usual regulatory angle (zip code in US often proxies for race), generally the need for interpretability goes with risk and frequency of decisions being made.

The risk and decision frequency profile of "tag photos with names in social media" is different vs. "decide which contract worth tens of millions of dollars is better to strike".

Both require some form of ML/Statistical inference but the higher the risk, the more the explainability required by the decision maker.

One strategy you can adopt is to break up a big decision into lots of smaller decisions (eg. Buy ads individually on auction vs. bulk publisher deals) but that kind of approach often comes with it own other costs (Infra to handle scale, transaction costs, lost negotiation leverage). In any real business investment scenario, you usually end up with many many decisions that are somewhat higher risk and they require explainability.

>in the context of machine learning, but I am still not really clear on why such a property is necessary or even desirable in the context of high dimensional statistical learning algorithms.

Depending on the context of where the Algo is used interpretability may in fact be necessary. (Try getting Deep Learning Models deployed in a Consumer Banking company.) Generally outside of hardcore tech companies (FAANG, etc.) you are usually building things in conjunction with Business partners within your org and good luck explaining to them that you want to deploy a completely opaque algorithm that will somehow solve their problems.

Completely agree.

Some of the tools for interpretability can be useful (particularly for debugging), but I think the broader idea that we always need to be able to understand our own models is basically wrong.

For example, if you want AlphaGo to explain why it made a particular Go move, what kind of an explanation is possible? In many cases the only explanation may be that the move leads to a higher probability of a win. There simply may not be a more “compressed” or “high level” answer. Even human Go players often cannot explain why they choose particular moves, other than references to shape and feel, which is basically another way of saying their evaluation of the move leads to a higher win probability. There are a lot of domains where we may just have to accept that that _is_ the explaination.

To zoom out a bit, our greatest discoveries have historically been about finding the rare places where the universe is computationally compressible. Boiling a kettle is almost unimaginably complex to describe in terms of the interaction of elementary particles. But you can make very good predictions about that process using an equation that fits on a cocktail napkin. There may be other areas in which the universe is compressible only to a lesser extent. The parameters of AlphaGo are vanishingly small compared to the size of the Go game tree, but are very large compared to the equation we can use to predict the kettle. There may be many problems where the best descriptions lie in this intermediate domain, a domain which we have never really had access to before (except via biological brains).

So if learned models give us access to some truths without access to their (human intelligible) explanations, I think we need to just embrace that. If you allow yourself a new way of seeing, you can see new things.

> So if learned models give us access to some truths without access to their (human intelligible) explanations, I think we need to just embrace that.

The question is: if the model is not interpretable or understandable, how do you know that what it gives you is, in fact, truth?

You basically need some kind of external validation of the results. In the case of Go, the rules of the game and competion basically provide that in a very authoritative way. I don't think this is the case for all that many domains.

As I see it, behind the desire for interpretability there are two main concerns:

* could a model that gives really good answers in all the cases we have tested still give catastrophically wrong answers in some cases we have not foreseen?

* could the model be relying on some flaw or bias in the training data which we haven't realized?

It might be desirable when the network/algorithm doesn't work as expected and throwing more data at the problem is not possible (say, you've exhausted all available data).

Could you give an example of an unsolved riddle from linguistics?

I agree that this is a nice piece, but I still think it's kind of fuzzy, and maybe not formal enough, and this might be related to your question.

To back up a bit, let's say you have some device (algorithm, black box, meta-DL model) that translates ML models into human-comprehensible language meeting some interpretability criterion.

Let's say that some ML model fails according to this device, that the device says "this is not translatable."

There's different possible reasons for this, but one might be that the ML model is itself at some level unlearnable, in the sense of being incompressible or unmodelable in itself, even by another black box machine. We can say that the model might meet some cross-validation criteria, etc., but if it's unexplainable in a meta-modeling sense, by anything, it implies maybe that the ML model is specious, that it's meeting some superficial criteria but not really doing what it's supposed to.

This "meta-modeling" is part of the process by which new ML models are developed by the way, at some level. It's often implicit, but we assume we understand something -- that is, there's some level of interpretability -- by virtue of the fact we can say "such and such type of DL structure is better for this type of domain" etc.

Of course, if the translating device says an ML model is untranslatable, it could just be that humans can't understand it, but the danger is that we don't really know at this point how to distinguish that from the case where the ML model is specious. It's a sort of meta-verification problem.

I also think that humans have some deeper understanding that isn't formalized yet into decision criteria regarding what constitutes a successful ML model. That is, we have some intuitive understanding of causality, and the idea of some things being causally "closer" to what we are interested in, in a vague sense. So when, e.g., photos of Obama are being classified based on silhouette position, we understand that there will be misclassification under a different set of stimulus conditions that are more comprehensive than what the model was trained and tested on. In that case, interpretability is tied, again, to speciousness and a failure of the model development process.

Incidentally, these arguments about interpretability parallel very closely debates in the psychological measurement literature in the 60s and 70s, about measures being selected based on their empirical performance ("empirically keyed" measures) versus other characteristics (e.g., internal structural considerations or theoretical interpretability criteria). There, there were similar arguments, in that some would say "it doesn't matter if the measure makes sense, the predictive performance matters". There were subsequent meta-analytic evaluations of how different approaches fared, and it turned out not to matter empirically in the long run. The reasons for this are difficult to explain in a small space, but one way to think about it is that when the empirically keyed measures were considered in a broader context than what they were developed on, they started to have limitations that were not initially considered (e.g., what happens when you have multiple empirically keyed measures simultaneously?). I think eventually the theoretically-based and internal-structural based approaches gained ground because they were easier to develop -- that is, you could improve them more and imbue them more easily with the types of characteristics you wanted them to have.

I think there's a lot to be learned from that debate in psychology. E.g., if you can't interpret an ML model, how do you achieve a set of goals in model development? What happens when constraints start to be introduced? You could approach this blindly but it seems you need some priors at least, which I think are kinda what human interpretability provides.

Maybe some day AI will be so well-developed that humans won't matter at all (e.g., it suffices to have some ML-to-AI translation device rather than an ML-to-human translation device), but I think we're far from that point.

A good eBook on the subject that the author continually updates- Interpretable Machine Learning, A Guide for Making Black Box Models Explainable.

This book helped me implement Accumulated Local Effects in Python which we used to explain a timeseries model.

[0] https://christophm.github.io/interpretable-ml-book/

This looks like a very good resource. Thanks for sharing.

Hi HN, I'm an editor at The Gradient. Coincidentally we were doing some server maintenance this evening so there may have been some downtime half an hour ago.

Apologies if anyone couldn't access the piece. Everything should be back up now.

Thanks everyone for contributing to the discussion.

There is a method that tries to classify images by decomposing them into components that are interpretable without too much loss in classification performance.

https://papers.nips.cc/paper/2019/hash/dca5672ff3444c7e997aa... https://github.com/saralajew/cbc_networks

I think one relevant difference is between discrete/constrained/physical targets/results (a new material, a medical image classifier, etc.) and continuous/unconstrained/incremental targets/results (nlp at large, advertising, etc.)? The former would need an explanation to satisfy a peer-review, the latter are more than happy with a black box that just works and beats the competition?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact