Image segmentation is of great interest in medical imaging, e.g. in imaging of tumors (1, 2), retina (3), lung (4), and the heart (5). In the latter, segmentation is applied to partition acquired images into functionally meaningful regions. Quantitative static and dynamic measures of diagnostic relevance are derived from that. These measures include myocardial mass, ventricular volumes, wall thickness, wall motion and ejection fraction. State-of-the-art performance for automatic segmentation is achieved with artificial neural networks (6–8). Many researchers demonstrated impressive performance on their test task and target data. However, neural networks also have limitations, mainly regarding generalization to new data and interpretability (9).
The limited generalization is particularly problematic as both training data and real world data are rarely from the exact same distribution. Methods to deal with so-called data set shift are subject of ongoing research (10). Furthermore, there might be the effect of hidden stratification (11), there is usually some kind of bias in sampling the training data (12) and networks might learn shortcuts (13) using unintended features to boost performance on the training set. This is commonly addressed by using diverse data sources and extensive data augmentation or sophisticated models (14). A general framework to evaluate, quantify and boost generalization is missing.
Explainability and interpretability of neural networks are additional active fields of research (9, 15). In model interpretability the goal is to understand how and why a model makes certain predictions. While local interpretability describes a certain prediction by the model based on a defined input, global interpretability delineates the understanding of general features determining the models’ predictions. Specifically for neural networks a variety of methods have been recently developed to determine so called attribution (16). Here attribution means evaluating the contribution of input features (17), layers (18) or single neurons (19) to the prediction.
Sensitivity analysis was first proposed by Widrow et al. in the context of misclassification caused by weight perturbations because of noisy input and machine imprecision (20). Ever since the term sensitivity analysis has been overloaded with different meanings related to each other. There is an entire book about sensitivity analysis in neural networks dealing with sensitivity to parameter noise (21). Here we define sensitivity analysis as exploration of the effect of input transformations on model predictions. The most closely related approach to the one presented here uses algorithm sensitivity analysis for tissue image segmentation (22). This work shares the general idea, however, differs in a variety of factors such as automatic parameter search and its focus on computational performance (22).
In this work, we describe a straightforward method to interpret arbitrary segmentation models. This sensitivity analysis provides intuitive local interpretations by transforming an input image in a defined manner and inspecting the impact of that transformation on the model performance.
It can be used to answer common questions in machine learning projects: Can a network, trained and published by someone else, be applied to my own data? Is it necessary or beneficial to prepare the data in a certain way? We demonstrate how these questions can be addressed by sensitivity analysis in the first case study. Other common questions are: How robust is a model that was trained on a limited dataset regarding characteristics of the data (e.g. orientation, brightness)? How problematic are potential perturbations such as image artifacts? An approach to solve this issue is described in the second case study.
Beside describing the method and highlighting its utility in two case studies, in addition we present an open source python library called misas (model interpretation through sensitivity analysis for segmentation) that makes it easy to apply sensitivity analysis to new data and segmentation models.
The software library described in this article is written in Python 3. The development was achieved by literate programming (23) in Jupyter notebooks using the nbdev framework, which provides all library code, documentation, and tests in one place. The source code is hosted on GitHub (https://github.com/chfc-cmi/misas) and archived at zenodo (https://doi.org/10.5281/zenodo.4106472). Documentation (https://chfc-cmi.github.io/misas) consists of both a description of the application programming interface (API) usage and tutorials, which include the two case studies. Continuous integration is provided by GitHub actions, where any version pushed to the master branch is tested by running all cells of each notebook in a defined minimal environment. Installable packages are released to the python package index (https://pypi.org/) for easy installation. misas builds on top of multiple other open source projects, including fastai (24), pytorch (25), torchio (26), and numpy (27).
The software is generic and framework-independent and was tested with pytorch, fastai v1, fastai v2, and tensorflow (28). In order to apply misas to new data, images and masks can be imported into misas from a variety of sources, e.g. from png images. The model needs to provide a prediction function that takes an image and returns a predicted segmentation mask (Fig. 1). If the model requires a defined input size, an optional function for size preparation can be provided. misas can be easily extended with custom transformation functions, which require input and output as instances of the Image/ImageSegment fastai classes, but can do arbitrary operations on the data in between.