Generation of a semi-automated image annotation framework
A total of 7,848 IHC stained high-resolution images of human testis, corresponding to 3,046 different antibody stainings and 2,794 unique proteins there divided into three different sets: a training set (5,411 images), a validation set (1,063 images) and a test set (1,374 images). All images were annotated manually in five germ cell types (spermatogonia, preleptotene spermatocytes, pachytene spermatocytes, round/early spermatids and elongated/late spermatids), and three somatic cell types (Sertoli cells, Leydig cells and peritubular cells), taking into consideration staining intensity (negative, weak, moderate, strong) and subcellular localization of the staining (cytoplasm, nucleus, membrane). The manually scored images formed the basis for a semi-automated image annotation framework, as presented in Figure 1.
Cell type-specific expression based on manual annotation
To determine the relationship between different cell types based on protein expression as determined by manual annotation, a correlation matrix was generated using Pearson's correlation and Ward's hierarchical clustering (Figure 2a). As expected based on functional characteristics 18, there were three main clusters: i) somatic cells (Sertoli cells, Leydig cells and peritubular cells), ii) premeiotic cells (spermatogonia and preleptotene spermatocytes) and iii) meiotic/post-meiotic cells (pachytene spermatocytes, round/early spermatids and elongated/late spermatids). Of the 7,848 images analyzed, only 815 (10%) showed immunoreactivity in one cell type only, while most of the images were positive in two to five cell types (Figure 2b). In 35 images, the human observer had marked all cell types as negative. When separated, the datasets showed slightly different proportions of the number of positive cell types (Figure 2c), where the test set comprised of more cell type-specific images and validation set contained a higher proportion of images with five to eight cell types that had been labeled (Figure 2c). There were large differences in the presence of different cell type labels (Figure 2d), with Leydig cells being labeled in as many as 5,218 (66%) of the images, while peritubular cells represented the most unusual staining pattern, positive in only 755 (10%) of the images. The staining was mostly localized to the cytoplasm, both cytoplasm and the plasma membrane, or the nucleus, but there were clear differences between cell types. Sertoli cells more often showed positivity in plasma membrane or a combination of nucleus + membrane, in most cases referred to as the nuclear membrane. A majority of the staining observed in Leydig cells was cytoplasmic (Figure 2d).
Training of neural network and overall model performance
The manually annotated images from the training set of 5,411 images and the validation set of 1,063 images were used for training a Hybrid Bayesian Neural Network (HBNet) model, exploiting DropWeights and combining the features from a standard deep neural network (DNN) with handcrafted features. The output of the neural network is an 8-dimensional probability vector, where each dimension indicates how likely each cell type in a given image expresses the protein. The neural network was then applied to the test set of 1,374 images, for which the accuracy was evaluated.
Evaluation metrics for multi-label classification performances are different from those used in binary or multi-class classification 21. In multi-label classification, a miss-classification is no longer a definite right or wrong, since a correct prediction containing a subset of the actual labels is considered better than a prediction containing none of them. Here, four different metrics were used for evaluating the multi-label classification performance: i) Hamming loss, ii) F1-score, iii) Exact Match ratio, and iv) mean-Average Precision (mAP). Table 1 presents the statistics for each of these metrics both for standard DNN and the proposed HBNet. Hamming loss is the most common evaluation metric in multi-label classification, which takes into account both prediction errors (false positives) and missed predictions (false negatives), normalized over the total number of classes and total number of samples analyzed. The smaller the value of Hamming loss (closer to 0), the better the performance of the learning algorithm. F1 score is the harmonic mean of recall and precision, where Macro F1 score calculates the metric independently for each label and then takes an average, and Micro F1 score aggregates the contributions of all labels when calculating the average metric. The Exact Match ratio is the strictest metric, indicating the percentage of all analyzed samples that have all their labels classified correctly. Mean Average Precision (mAP) takes into account both the average precision (AP) separately for each label and the average over the class. It provides a measure of quality across recall levels, and has shown to be stable and able to distinguish between cell types. The higher the mAP (closer to 100), the better the quality. In the present investigation, there was considerable improvement using HBNet across all metrics used (Table 1). Based on HBNet, the Exact Match ratio showed that 67% of the 1,374 images were correctly classified in all eight cell types.
Cell type-specific model performance
Next, we evaluated the model’s performance on a cell type-specific level. In Figure 3, a confusion matrix is shown, comparing the output of the neural network with the manual observer, summarising the false positives and negatives of the DNN and the HBNet for each cell type. For all cell types, HBNet had a higher accuracy than DNN, with >80% overall accuracy, and >90% for Sertoli cells and peritubular cells. The largest difference between DNN and HBNet was seen for pachytene spermatocytes and round/early spermatids, where the accuracy improved from 75.6 to 82.6%, and 69.3 to 80.5%, respectively. HBNet dramatically reduced the number of false negatives compared to DNN, but also showed a decrease in the number of false positives. The total number of false positives (n=444) across all cell types was lower compared to the number of false negatives (n=993), indicating that the model performed better at accurately detecting positive labels, but more often differed with the human observer in classifying cell types as negative. This is expected, due to the human observer deliberately neglecting very weak staining patterns that can be considered unspecific or being due to artifacts. The ratios between false positives and false negatives were however opposite for Sertoli cells and peritubular cells, for which false negatives were rare. Positivity in these cell types was not only less common in general (Figure 2d), but also to a larger extent cell type-specific and not as often showing simultaneous staining in other cell types (Figure 2a). This suggests that positivity in these cell types was mostly considered as specific by the human observer.
Estimation of model certainty
To rank all images based on model confidence over eight cell types, each prediction included an uncertainty measurement, presented as a GTL Score. Supplementary Table 1 shows the predictions per cell type for each of the 1,374 images in the test set, along with GTL Score and manual annotation. The GTL Scores ranged from zero to one for each HBNet prediction over the eight cell types. All predictions were then plotted in confidence maps (Figure 4), where images for which the model agreed with the human observer, i.e. the cell type was truly positive or truly negative, were marked in green, whilst images with disagreement between the model and the human observer were marked in red. Images suggested to be misclassified tend to have lower GTL Scores, compared to correctly classified images. The shape of the GTL curves varies for each cell type, and the curves for Sertoli cells and peritubular cells stood out as having a higher proportion of images with low GTL Scores than the other cell types. This is because staining in these cell types was less common (Figure 2d), and cell types classified as lacking staining often have low GTL Scores. The spread of misclassifications determined the cutoff for reliable classification, which was marked as a blue line. Note that this cutoff was set at a GTL Score between 0.0 and 0.11 for all types except pachytene spermatocytes, round/early spermatids and elongated/late spermatids, for which it was set at 0.22, 0.78 and 0.22, respectively. The protein expression patterns of these three cell types showed a high correlation (Figure 2a), suggesting that many proteins were co-expressed in these cells. Since they were not mutually exclusive, this may explain why the model would have more difficulties to distinguish these cell types from each other.
When only considering thresholded samples above the GTL cutoff, including classifications of high reliability, the classification accuracy of the HBNet model was substantially improved (Table 2). The HBNet GTL-thresholded accuracy was >92% for all cell types except for round/early spermatids, which had an accuracy of 83.5%. For most cell types, approximately 30 to 39% of the images were below the GTL cutoff, except for peritubular cells where only 1.3% of the images were discarded, and Sertoli cells, where none were. Predictions above cutoff can be considered reliably annotated by the model, which means that manual annotation is only needed for on average 28.1% of the predictions. Note that there is a direct tradeoff for choice of GTL threshold between accuracy and number of discarded images (Supplementary Figure 1).
Evaluation of correctly classified and misclassified images
The GTL confidence metric allowed us to identify both correctly classified images, as well as images where the model disagreed with the human observer for one or several cell types. In Figure 5, examples of correctly classified images are provided, i.e. these images were among the 67% that according to the Exact Match Ratio had all eight cell types annotated as either true positive or true negative. The images show that the model performed well both for proteins with distinct and selective staining and for more complex images where the protein was expressed in several cell types of varying intensity and staining patterns. The IHC stained images are presented along with heatmaps 22 highlighting which area of the images that the model focused on for making the labeling decision. For the correctly classified images, it is evident that the model focused on several different areas within the image, including areas where cells were intact and well-represented.
Misclassified predictions included both falsely positive and falsely negative images, and could be further divided into cases with high certainty (high GTL Score) and low certainty (low GTL Score). Several misclassified predictions represented clear errors made by the manual observer (Figure 6a). Such misclassifications often had high GTL Scores, and in these cases, the model can be used for identifying manual mistakes. Other misclassified predictions were due to unspecific staining deliberately neglected by the human observer (Figure 6b). Such stainings in need of further protocol optimization were often represented by false negative predictions with high GTL Scores, indicating that the model performed a correct prediction, but based on experience, the positivity was interpreted as unspecific by the human observer. Some misclassified images corresponded to proteins expressed in small structures including nuclear membranes, nucleoli or centrosomes (Figure 6c). Such staining patterns are rare, and may be particularly challenging for the model to interpret due to limitations in the current pixel resolution. These predictions were often false positives with low GTL Scores. Finally, some misclassified images contained artifacts, such as damaged tissue sections, or sections that contained areas where the testicular samples were not completely healthy (Figure 6d). Such misclassifications, both false positives and false negatives, often had low GTL Scores and it was evident from the model heatmaps that the labeling decisions were mostly made on areas of the images where not all cell types were clearly represented, or the image/visible cells had poor quality.
Model performance based on subcellular localization and staining intensity
The manual annotation of the cell type-specific protein expression did not only take into consideration which cell types that were positive, but also in which subcellular organelle the staining was observed. In Table 3, the GTL-thresholded model performance in the test dataset is presented on a subcellular level. Similarly, as in the whole dataset, (Figure 2d), it was clear that some organelles were more common in certain testicular cell types, which may affect the overall accuracy, but it should also be noted that the patterns of different subcellular localizations appear differently in the various cell types based on the cell shape. In total, the best accuracy was found for staining patterns where all subcellular localizations (cytoplasmic, membranous and nuclear) were present. This is not surprising, as clear outlining of each cell structure increases the likelihood of the model identifying the correct cell types. Sertoli cells had lower accuracy of certain subcellular localizations compared to other cell types. Staining of Sertoli cells is challenging to interpret as these cells are situated in the interspace between the germ cells, and staining may be difficult to distinguish from other cell types.
In addition to cell type-specific pattern and subcellular localization of the staining, the human observer also takes into consideration the intensity of the staining. This rather subjective measurement that determines the brown saturation level, is considered to represent the amount of protein expression ranging from low levels (weak staining/beige color), through moderate levels (medium brown) to high levels (dark brown/black). As seen in Table 4, it is evident that the GTL-thresholded accuracy did not depend on staining intensity, and there was no significant improvement in predictions performed on distinctly stained cells compared to those that showed more faint positivity.