Implementing gradient explanations for a HuggingFace text classification model

Der_Einzige · on May 30, 2022

A few notes:

1. Huggingface models are supported by Captum - a framework for gradient based explanations of any pytorch model: https://captum.ai/tutorials/Bert_SQUAD_Interpret

2. There are several huggingface "spaces" which show-case in the browser the ability to do model explanations on huggingface models using a variety of techniques, such as with LIME: https://huggingface.co/spaces/Hellisotherpeople/Interpretabl...

or with SHAP: https://huggingface.co/spaces/Hellisotherpeople/HF-SHAP

and there is def an example already of doing it with gradient based techniques but I'm having trouble finding it!

3. It's cool to see someone do this with from-scratch code, since gradient based explanation techniques are very complicated and also have a lot of variance from one technique to another.

vykthur · on May 30, 2022

Yes. Captum is a great library and a few of my colleagues have used it with good results in the past. Like you mention, most of the few examples that demonstrate gradient based explanation methods for Huggingface models typically focus on Pytorch models. The example here looks at things from the Tensorflow 2.0/Keras perspective. One thing to note is that model agnostic SHAP can be resource intensive to compute , especially compared to gradient methods that require a single pass through the model for a datapoint.

p1esk · on May 30, 2022

How would you characterize the degree of explainability of large language models (e.g. gpt3)? Anything surprising or non-intuitive there?

behnamoh · on May 30, 2022

i think a more general question is how would you measure explainability when you see it? is there some sort of metric for that?

visarga · on May 30, 2022

One weak point I see - this tool only measures how much an individual input token would be changed to decrease the loss. The reason a token might have large gradients might be related to how many times it appears in the training set and how consistent is the training set with the evaluation set, not just how much the prediction disagrees with the target label.

So it jointly measures data coverage, consistency and data to target fit. Just my intuition, might be wrong.