A key challenge in machine learning is to understand how a black box model comes up with its predictions. Several interpretation techniques have been developed over the years to address this issue. Some of them are model-specific while others can be used with any machine learning model and are called model-agnostic. Interpretation tools are useful for understanding ML models, but they need to be designed carefully in order to not impose too much cognitive load on users and produce interpretable outputs. The interpretability needs of different user groups are also diverse. In some cases, they need to know what a certain prediction is about while in other instances it is more important to understand why a particular outcome was selected by the model.
This article explores the various interpretation techniques that can be applied to machine learning classification models and the challenges with using them. It aims to provide a comprehensive overview of the existing interpretation technics to allow readers to select and apply them in the context of their own project. The different types of interpretability methods can be categorized into two broad categories: 1) intrinsic interpretability methods, which are innately interpretable because of their simpler formation such as linear models and tree-based models; 2) post hoc interpretability methods, which are applied to explain predictions generated by black box models like neural networks and ensembles, for example feature importance or partial dependence and LIME (Local Interpretable Model-agnostic Explanations). Intrinsic interpretation methods tend to focus on explaining the decision process that a machine learning algorithm takes for each data point.
They do this by comparing the weight of features on the predicted outcomes and showing which are most influential. However, these techniques can be inaccurate if the input features are correlated, as they can lead to a false positive interpretation. Model-agnostic interpretability methods, on the other hand, are more general and can be used with any classifier, including state of the art models such as deep neural networks. They work by analyzing the predictions generated by the model in permuted datasets to find a set of features that best correlate with the original outcome and explain why this specific combination was chosen.
Typicality scores can be an effective approach to interpretability, as they highlight similarities between concepts that appear in images and the classification labels -also known as concepts- that the model assigns to them. ICE curves and permuted feature importance are other interpretability techniques that can be used to visualize the contribution of a given feature value to a particular prediction. To be useful, an explanation needs to be at least partially faithful -that is it should correspond to how the model behaves in the region where the interpretation is made. But achieving this level of fidelity is challenging, even for simple models.