Semi-automated annotation of cell type-specic protein expression patterns in human testis based on immunohistochemistry

Immunohistochemistry (IHC) provides the basis for cell type-specic localization of protein expression patterns in human tissues. Manual annotation of complex IHC images is however expensive and may lead to errors or inter-observer variability. Articial Intelligence holds much promise for ecient and accurate pattern recognition, but condence in prediction needs to be addressed. To present a reliable model for annotation of IHC images, we developed a semi-automated framework for multi-label classication of 7,848 human testis samples stained with IHC, and manually annotated in situ protein expression in eight different cell types. The dataset was used as a basis for training and testing a proposed Hybrid Bayesian Neural Network. By combining the deep learning model with a novel uncertainty metric, the average diagnostic performance improved from 86.9% to 96.3%. The streamlined workow has important implications for accurate large-scale efforts mapping the human cell type-specic proteome in health and disease. images and the validation set of 1,063 images were used for training a Hybrid Bayesian Neural Network (HBNet) model, exploiting DropWeights and combining the features from a standard deep neural network (DNN) with handcrafted features. The output of the neural network is an 8-dimensional probability vector, where each dimension indicates how likely each cell type in a given image expresses the protein. The neural network was then applied to the test set of 1,374 images, for which the accuracy was evaluated.


Cell type-speci c expression based on manual annotation
To determine the relationship between different cell types based on protein expression as determined by manual annotation, a correlation matrix was generated using Pearson's correlation and Ward's hierarchical clustering ( Figure 2a). As expected based on functional characteristics 18  Evaluation metrics for multi-label classi cation performances are different from those used in binary or multi-class classi cation 21 . In multi-label classi cation, a miss-classi cation is no longer a de nite right or wrong, since a correct prediction containing a subset of the actual labels is considered better than a prediction containing none of them. Here, four different metrics were used for evaluating the multi-label classi cation performance: i) Hamming loss, ii) F1-score, iii) Exact Match ratio, and iv) mean-Average Precision (mAP). Table 1 presents the statistics for each of these metrics both for standard DNN and the proposed HBNet. Hamming loss is the most common evaluation metric in multi-label classi cation, which takes into account both prediction errors (false positives) and missed predictions (false negatives), normalized over the total number of classes and total number of samples analyzed. The smaller the value of Hamming loss (closer to 0), the better the performance of the learning algorithm. F1 score is the harmonic mean of recall and precision, where Macro F1 score calculates the metric independently for each label and then takes an average, and Micro F1 score aggregates the contributions of all labels when calculating the average metric. The Exact Match ratio is the strictest metric, indicating the percentage of all analyzed samples that have all their labels classi ed correctly. Mean Average Precision (mAP) takes into account both the average precision (AP) separately for each label and the average over the class. It provides a measure of quality across recall levels, and has shown to be stable and able to distinguish between cell types. The higher the mAP (closer to 100), the better the quality.
In the present investigation, there was considerable improvement using HBNet across all metrics used (Table 1). Based on HBNet, the Exact Match ratio showed that 67% of the 1,374 images were correctly classi ed in all eight cell types.
Cell type-speci c model performance Next, we evaluated the model's performance on a cell type-speci c level. In Figure 3, a confusion matrix is shown, comparing the output of the neural network with the manual observer, summarising the false positives and negatives of the DNN and the HBNet for each cell type. For all cell types, HBNet had a higher accuracy than DNN, with >80% overall accuracy, and >90% for Sertoli cells and peritubular cells. The largest difference between DNN and HBNet was seen for pachytene spermatocytes and round/early spermatids, where the accuracy improved from 75.6 to 82.6%, and 69.3 to 80.5%, respectively. HBNet dramatically reduced the number of false negatives compared to DNN, but also showed a decrease in the number of false positives. The total number of false positives (n=444) across all cell types was lower compared to the number of false negatives (n=993), indicating that the model performed better at accurately detecting positive labels, but more often differed with the human observer in classifying cell types as negative. This is expected, due to the human observer deliberately neglecting very weak staining patterns that can be considered unspeci c or being due to artifacts. The ratios between false positives and false negatives were however opposite for Sertoli cells and peritubular cells, for which false negatives were rare. Positivity in these cell types was not only less common in general (Figure 2d), but also to a larger extent cell typespeci c and not as often showing simultaneous staining in other cell types (Figure 2a). This suggests that positivity in these cell types was mostly considered as speci c by the human observer.

Estimation of model certainty
To rank all images based on model con dence over eight cell types, each prediction included an uncertainty measurement, presented as a GTL Score. Supplementary Table 1 shows the predictions per cell type for each of the 1,374 images in the test set, along with GTL Score and manual annotation. The GTL Scores ranged from zero to one for each HBNet prediction over the eight cell types. All predictions were then plotted in con dence maps (Figure 4), where images for which the model agreed with the human observer, i.e. the cell type was truly positive or truly negative, were marked in green, whilst images with disagreement between the model and the human observer were marked in red. Images suggested to be misclassi ed tend to have lower GTL Scores, compared to correctly classi ed images. The shape of the GTL curves varies for each cell type, and the curves for Sertoli cells and peritubular cells stood out as having a higher proportion of images with low GTL Scores than the other cell types. This is because staining in these cell types was less common (Figure 2d), and cell types classi ed as lacking staining often have low GTL Scores. The spread of misclassi cations determined the cutoff for reliable classi cation, which was marked as a blue line. Note that this cutoff was set at a GTL Score between 0.0 and 0.11 for all types except pachytene spermatocytes, round/early spermatids and elongated/late spermatids, for which it was set at 0.22, 0.78 and 0.22, respectively. The protein expression patterns of these three cell types showed a high correlation (Figure 2a), suggesting that many proteins were co-expressed in these cells. Since they were not mutually exclusive, this may explain why the model would have more di culties to distinguish these cell types from each other.
When only considering thresholded samples above the GTL cutoff, including classi cations of high reliability, the classi cation accuracy of the HBNet model was substantially improved ( Table 2). The HBNet GTL-thresholded accuracy was >92% for all cell types except for round/early spermatids, which had an accuracy of 83.5%. For most cell types, approximately 30 to 39% of the images were below the GTL cutoff, except for peritubular cells where only 1.3% of the images were discarded, and Sertoli cells, where none were.
Predictions above cutoff can be considered reliably annotated by the model, which means that manual annotation is only needed for on average 28.1% of the predictions. Note that there is a direct tradeoff for choice of GTL threshold between accuracy and number of discarded images (Supplementary Figure 1).

Evaluation of correctly classi ed and misclassi ed images
The GTL con dence metric allowed us to identify both correctly classi ed images, as well as images where the model disagreed with the human observer for one or several cell types. In Figure 5, examples of correctly classi ed images are provided, i.e. these images were among the 67% that according to the Exact Match Ratio had all eight cell types annotated as either true positive or true negative.
The images show that the model performed well both for proteins with distinct and selective staining and for more complex images where the protein was expressed in several cell types of varying intensity and staining patterns. The IHC stained images are presented along with heatmaps 22 highlighting which area of the images that the model focused on for making the labeling decision. For the correctly classi ed images, it is evident that the model focused on several different areas within the image, including areas where cells were intact and well-represented.
Misclassi ed predictions included both falsely positive and falsely negative images, and could be further divided into cases with high certainty (high GTL Score) and low certainty (low GTL Score). Several misclassi ed predictions represented clear errors made by the manual observer ( Figure 6a). Such misclassi cations often had high GTL Scores, and in these cases, the model can be used for identifying manual mistakes. Other misclassi ed predictions were due to unspeci c staining deliberately neglected by the human observer ( Figure 6b). Such stainings in need of further protocol optimization were often represented by false negative predictions with high GTL Scores, indicating that the model performed a correct prediction, but based on experience, the positivity was interpreted as unspeci c by the human observer. Some misclassi ed images corresponded to proteins expressed in small structures including nuclear membranes, nucleoli or centrosomes (Figure 6c). Such staining patterns are rare, and may be particularly challenging for the model to interpret due to limitations in the current pixel resolution. These predictions were often false positives with low GTL Scores. Finally, some misclassi ed images contained artifacts, such as damaged tissue sections, or sections that contained areas where the testicular samples were not completely healthy (Figure 6d). Such misclassi cations, both false positives and false negatives, often had low GTL Scores and it was evident from the model heatmaps that the labeling decisions were mostly made on areas of the images where not all cell types were clearly represented, or the image/visible cells had poor quality.

Model performance based on subcellular localization and staining intensity
The manual annotation of the cell type-speci c protein expression did not only take into consideration which cell types that were positive, but also in which subcellular organelle the staining was observed. In Table 3, the GTL-thresholded model performance in the test dataset is presented on a subcellular level. Similarly, as in the whole dataset, (Figure 2d), it was clear that some organelles were more common in certain testicular cell types, which may affect the overall accuracy, but it should also be noted that the patterns of different subcellular localizations appear differently in the various cell types based on the cell shape. In total, the best accuracy was found for staining patterns where all subcellular localizations (cytoplasmic, membranous and nuclear) were present. This is not surprising, as clear outlining of each cell structure increases the likelihood of the model identifying the correct cell types. Sertoli cells had lower accuracy of certain subcellular localizations compared to other cell types. Staining of Sertoli cells is challenging to interpret as these cells are situated in the interspace between the germ cells, and staining may be di cult to distinguish from other cell types.
In addition to cell type-speci c pattern and subcellular localization of the staining, the human observer also takes into consideration the intensity of the staining. This rather subjective measurement that determines the brown saturation level, is considered to represent the amount of protein expression ranging from low levels (weak staining/beige color), through moderate levels (medium brown) to high levels (dark brown/black). As seen in Table 4, it is evident that the GTL-thresholded accuracy did not depend on staining intensity, and there was no signi cant improvement in predictions performed on distinctly stained cells compared to those that showed more faint positivity.
Discussion IHC constitutes the standard approach for spatial localization of proteins at a cell type-speci c level. The technology originates from the early 1940s 23 and has emerged as a quick, simple and cost-effective method applicable to both diagnostic routine as well as basic and clinical research. The output of the IHC staining is typically a tissue section manually evaluated under a microscope, but with advances in digital pathology, large-scale digitization of stained sections is becoming more common. This allows for the development of algorithms to automatically predict the IHC staining. There is a widely acknowledged demand for implementation of such algorithms both in healthcare and research for more accurate and fast annotation. Until date, there are however no previous studies suggesting how such frameworks can be implemented for high-throughput annotation of complex tissue samples stained with IHC. Despite impressive reported accuracy, deep learning models tend to require huge training sample image sets. Furthermore, they tend to make overcon dent predictions and lack the ability to report "I don't know" for ambiguous or unknown cases. It is therefore not su cient to depend on prediction scores alone from deep learning models, but critical to estimate bias-reduced uncertainty as an additional insight to the prediction.
Combining IHC with the TMA technology where a large number of tissue samples are assembled into one array allows for large-scale mapping efforts studying the entire human proteome. Built upon this strategy, the HPA project has characterized >15,000 different proteins across >40 different normal tissues and organs, and 20 types of cancer 1-2 . The publicly available database www.proteinatlas.org contains >10 million high-resolution images that have been manually annotated, thereby constituting a major resource for machine learning algorithms. In the present investigation, we focused on images of normal testis, due to the complex architecture of this organ built up by several different cell types that are challenging to interpret, and the unique nature of this tissue harbouring a large number of speci c proteins of unknown function that are interesting to characterize further [16][17] .
We here successfully associated deep learning-based predictions on cell type-speci c protein expression patterns in histological sections stained with IHC. The predictions were combined with hybrid image features, DropWeights methods and approximate BNN to compute a bias-reduced uncertainty score as a vital additional measure, generating an uncertainty measurement de ned as GTL Score.
The proposed HBNet architecture showed outstanding performance in both simple images with clear cell type-speci c staining, and more complex images where several cell types showed positivity of varying intensity and staining patterns. The novel GTL Score adds another level of insight, particularly important for challenging cases where uncertain predictions can be highlighted. This unique work ow of image annotation allows for dividing the dataset into images that are reliably classi ed by the model, and images that need to be annotated by the manual observer, thereby introducing a semi-automated high-accuracy framework that reduces the manual burden.
Manual annotation of testis samples is a tedious task, as the germ cell lineage constitutes a continuum, where the stem cells undergo several steps of mitosis and meiosis before being developed into mature sperm. This complex process involves thousands of genes and proteins activated and repressed at certain time points, which means that different proteins are expressed in certain combinations of these cell types. Some proteins may be expressed in just one subset, while others are more ubiquitously expressed. Many proteins are increased or decreased during differentiation, seen as a gradient in expression. It should also be noted that despite the germ cells can be divided into distinct stages and cell types, some proteins may be expressed just between these stages during a short period in time, which means that such proteins may be found only in the seminiferous ducts that are in the correct stage, i.e. variations may occur within the same image. Furthermore, depending on the stage and how the section was taken, not all ducts within an image may contain the end product of the spermatogenesis -the mature sperm. A manual observer needs to take into consideration all cells for each of the eight cell types that are present within an image, and make an average decision that corresponds to the overall expression pattern.
The manual annotation is not only based on visual examination of staining intensity, but to a large extent also relies on experience, where the manual observer takes into consideration staining protocol, overall image quality, artifacts and previous literature on the protein being analyzed. An optimized staining protocol shows only speci c antibody binding, i.e. brown color, in structures expressing the protein. It is however extremely challenging to retrieve the right balance of speci c staining vs. unspeci c antibody binding 24 , and despite the HPA spends considerable efforts on antibody validation 25 , many IHC images contain weak staining that may be considered unspeci c. Such off-target binding means that the antibody when present in high enough concentration binds to structures that do not express the protein. This staining is of lower intensity than the true protein expression, and can be neglected by the human observer, especially when accompanied with distinct staining in other structures that more likely represents the true protein expression. Such experience-driven decisions take into consideration the overall positivity and general staining pattern in the whole image. If certain cells show strong nuclear staining, it is likely that faint cytoplasmic staining seen in other cells is unspeci c and regarded as background, even if it is of an intensity that in other images would be considered above threshold. Furthermore, the manual observer can consult available literature on proteins that are challenging to interpret, which may guide in which cell types or subcellular localization the protein should be present. Finally, the manual observer is better at detecting artifacts. The edge of a tissue core may more often attract unspeci c binding, and staining only observed near this border would therefore often be neglected. Despite all tissue samples having undergone quality control before processing, it is possible that a fraction of the samples or a certain part within an image contain cells that are not healthy. If representative cells are still available, the human observer would only consider these in the annotation process and neglect areas with artifacts. Images may also contain mechanical artifacts related to the tissue processing such as folds, scratches or damaged structures, or parts of the images may be unfocused during digitization.
Despite the challenges related to tissue processing, IHC staining and manual annotation, our proposed HBNet showed high accuracy for all eight cell types, especially after applying a GTL Score threshold. When examining images above and below this threshold, it was evident that many images for which the model faced challenges constituted images expected to be particularly di cult, often due to the reasons described above. Three cell types needed a higher GTL Score threshold for reliable prediction: pachytene spermatocytes, round/early spermatids and elongated/late spermatids. This is not surprising, as these cells correspond to the most common combination for proteins co-expressed in more than one testicular cell type, as described previously 18 . This means that proteins are more commonly expressed in combinations of two to three of these cell types than solely expressed in just one of them. As a result, a high proportion of the images used for model training corresponded to proteins co-expressed in these cells, thereby leading to challenges in distinguishing them.
Previous multi-level classi cation studies, including a recent Kaggle challenge 26  Previous studies conducted on histological images mainly focused on tissue type classi cation for disease detection 12 . Here, we present the rst approach in multi-label classi cation based on antibody-based proteomics, to recognize cell type-speci c protein expression patterns in eight different testicular cell types.
In addition to highlighting which images that need to be examined by the manual observer and allow for large-scale IHC efforts, our proposed work ow incorporates a con dence metric that has important implications for identifying images with manual annotation errors, and thereby improving the overall accuracy. This is applicable to both research and clinical routine and may replace the otherwise common manual annotation work ow by which one observer rst annotates each image, followed by quality control by a second observer. It may also be used for teaching purposes in the training of manual observers that have less experience, which saves both time and money as less quality control is needed from experienced personnel.
In the present investigation, healthy samples from one particular tissue -testis -were used. Based on the encouraging performance of our proposed model for what constitutes a particularly challenging tissue, we believe that the approach is applicable also on other tissues. Similar work ows can be used in projects focusing on distinguishing between healthy and diseased tissues, widely applicable to e.g. cancer research but also routine diagnostics. The daily pathology work ow largely depends on manual microscopic evaluation of tissue sections, which may not only lead to a delayed disease diagnosis with potential worsened patient prognosis but also to a false diagnosis 31 . Further advances in automated annotation of histological sections are therefore clearly warranted.
In the explosive era of "big data", the emerging eld of single cell RNA-seq (scRNA-seq) has received increased attention during the last few years. This novel technology allows for quantitative measurements of single cell transcriptomes across different human tissues and cell types 32 . Further advances in this eld will lead to the possibility to identify sets of genes and proteins elevated in certain cell types, e.g. a gene may not only be de ned as elevated in testis in comparison to other organs, but robust data suggests that the gene is elevated in e.g. spermatogonia. In order to translate such ndings on the transcriptomic level to functionally relevant information, it is necessary to complement the data with studies on the protein level, as the proteins constitute the functional representation of the genome. With IHC constituting the main approach for cell type-speci c localization of proteins in human tissues, this will likely result in an increased interest for in situ-based techniques such as IHC and other antibody-based technologies, both for studying one protein at a time, and multiplex efforts where several different proteins are labeled simultaneously in one tissue section. For implementation of such large-scale efforts, machine learning approaches that can save both time and money and lead to more accurate image annotations which is highly desirable.
Here, we present a comprehensive strategy for semi-automated annotation of IHC sections combined with an uncertainty metric. The suggested streamlined work ow constitutes an important approach for accurate large-scale efforts mapping the human proteome at a cell type-speci c level, and holds promise for both research and diagnostics aiming at analyzing the spatio-temporal expression of human proteins in health and disease.

Tissues and protein pro ling
Human tissue samples for IHC analysis were collected and handled in accordance with Swedish laws and regulations. Tissues were obtained from the Clinical Pathology department, Uppsala University Hospital, Sweden and collected within the Uppsala Biobank organization. All samples were anonymized for personal identity by following the approval and advisory report from the Uppsala Ethical Review Board (Ref # 2002-577, 2005-388, 2007-159). Informed consent was obtained from all subjects in the study. Generation of tissue microarrays (TMAs), IHC staining and digitization of stained TMA slides was performed essentially as previously described 33 . In brief, formalin-xed, para n-embedded (FFPE) tissue blocks were assembled into TMAs based on 1 mm cores from 44 different normal tissue types corresponding to three individuals per tissue, including normal testis samples from adult individuals. TMA blocks were cut in 4 µm sections, dried overnight at room temperature (RT), and baked at 50°C for at least 12h. Automated IHC was performed by using Lab Vision Autostainer 480S Module (ThermoFisher Scienti c, Freemont, CA), as described in detail previously. The stained slides were digitized with ScanScope AT2 (Leica Aperio, Vista, CA) using a 20x objective. All digital images corresponding to antibody data that passed HPA quality criteria were made publicly available on www.proteinatlas.org

Dataset and manual annotation
High-resolution digital images of IHC stained testis TMA cores corresponding to 512 testis elevated proteins 18 , publicly available on the HPA version 18 (v18.proteinatlas.org), were downloaded along with images from 2,282 proteins published in the current version 19 (v19.proteinatlas.org) that previously had been manually annotated as showing IHC staining of moderate intensity in at least a subset of cells in testis. All proteins were analyzed with at least one antibody that was approved according to Human Protein Atlas criteria for antibody validation. For most of the proteins, three different images were available, and the total dataset comprised 7,848 images corresponding to 2,794 unique human proteins. Each antibody staining was manually re-annotated in eight different testicular cell types, including ve germ cell types (spermatogonia, preleptotene spermatocytes, pachytene spermatocytes, round/early spermatids and elongated/late spermatids), and three somatic cell types (Sertoli cells, Leydig cells and peritubular cells). The annotation considered staining intensity (negative, weak, moderate, strong) and subcellular localization (cytoplasmic, nuclear, membranous, or a combination of those). The entire dataset was divided into three sets: A training set of 5,411 images (manually annotated by one observer), a validation set of 1,063 images (manually annotated by one observer and quality controlled by two independent observers, as previously described 18 ), and a test set of 1,374 images (manually annotated by one observer and quality controlled by one independent observer).

The Hybrid Bayesian Neural Network (HBNet)
The models were trained and evaluated using Keras with a Tensor ow backend. For hybrid feature extraction, we used a combination of hand-crafted feature extraction and a convolutional neural network (CNN) approach [34][35] . The original JPEG images of 3000x3000 pixels were resized to 1024x1024 pixels using a bicubic interpolation over a 4x4 pixel neighborhood.
We used a generic building block containing the VGG16 36-37 network to extract deep image features and generate heatmaps. The output of the nal pooling layer was the CNN feature. We introduced fully connected layers on top of the VGG16 convolutional base to keep the main parts of VGG16 architecture and connect hand-crafted features to the end of CNN feature as the input of the fully connected layers.
Hand-crafted features were extracted separately complementary to the CNN feature. The hand-crafted approaches used were Histogram of Oriented Gradients (HOG) 36 , Haralick 38 and HU Moments 39 . HOG was applied to all images equally, with eight orientation bins, 8x8 pixels forming a single cell, and those cells organized in 8x8 formation to form a block. This feature vector containing the image descriptions is the input into the feature selection and classi cation algorithm. A hybrid feature vector increases the dimensionality of image features. We therefore extracted a 3,732-component feature vector by using the HU, Haralick, HOG method and a 256-component feature vector using the CNN method. Thus, we used the subspace method to reduce the dimensionality of the hybrid feature vector using PCA to classify and estimate uncertainty in classi cation.
DropWeights followed by a sigmoid activated layer was then applied to the network in the fully connected layer as an approximation to the Gaussian Process (GP), to cast it as approximate Bayesian inference for meaningful estimation of model uncertainty.
In HBNet, over tting was reduced by using DropWeights with a rate of 0.3, which means that during both training and inference, approximately one-third of all weights were turned off and set to 0. To train the model in our study, we used an Adam optimizer with the default learning rate of 0.001. The training process was conducted in 250 epochs, with mini-batch size 32. We monitored the validation accuracy after every epoch and saved the model with the best accuracy on the validation dataset. During test time, DropWeights were active and Monte Carlo (MC) sampling was performed by feeding the input image with 1000 MC samples through the HBNet. This in turn, allowed us to apply variational DropWeights during testing 40 . For every tested image, the model provided not only its predicted class but also a measure of uncertainty estimated using variational DropWeights (see GTL con dence below).
The cell type labels in multi-label datasets may be correlated and a prediction for a cell type is not mutually exclusive. Therefore, we utilized label correlation information during classi cation. For the cost function for multi-label classi cation, we selected the sigmoid function with the addition of binary cross-entropy. A grid search scheme was adopted based on Matthews Correlation Coe cients (MCC) to determine the optimal thresholds for each dimension on the model outcome, which improves the accuracy of the model.

Approximate Bayesian Neural Networks (BNN)
Bayesian Neural Networks (BNN) provide a natural framework for modeling uncertainty. BNN methods are however intractable in computing the posterior of a network's parameters. The most common approach to estimate uncertainty in deep learning places distributions over each of the network's weight parameters. There are many methods proposed for quantifying uncertainty or con dence estimates approximated by MC dropout, including Laplace approximation, Markov chain MC (MCMC) methods, stochastic gradient MCMC variants such as Langevin Dynamics, Hamiltonian methods including Multiplicative Normalising Flows, Stochastic Batch Normalization, Maximum Softmax Probability, Heteroscedastic Classi er, and Learned Con dence Estimates including Deep Given a dataset and the corresponding labels where be a d-dimensioned input vector and with , C class label, a set of independent and identically distributed (i.i.d.) training samples size N for , the task is to nd a function using weights of neural net parameters as close as possible to the original function that has generated the outputs . The principled predictive distribution of an unknown label of a test input data by marginalizing the parameters: The expectation of is called the predictive mean of the model, and its variance is called the predictive uncertainty.
Unfortunately, nding the posterior distribution is often computationally intractable. Recently, Gal 41 proved that a gradientbased optimization procedure on the dropout neural network is equivalent to a speci c variational approximation on an HBNet. where . For each test sample , the class with the largest predictive mean is selected as the predictive probabilities.
Ghoshal-Tucker-Lindskog (GTL) Con dence Score Based on the input sample, a network can be certain with high or low con dence of its decision, indicated by the predictive posterior distribution. Traditionally, it has been di cult to implement model validation under epistemic uncertainty. Thus, we predicted that epistemic uncertainty could inform model uncertainty. One of the measures of model uncertainty is predictive entropy of the predictive distribution: where ranges over all class labels. In general, the range of the obtained uncertainty values is dependent on e.g. the dataset, network architectures and the number of MC samples. Therefore, we normalized the estimated uncertainty to report our results and facilitate comparison across various sets and con gurations. Estimation of entropy from the nite set of data suffers from a severe downward bias when the data is under-sampled. Even small biases can result in signi cant inaccuracies when estimating entropy. We leveraged the plug-in estimate of entropy and the Jackknife resampling method to calculate bias-reduced entropy 40,[43][44][45] . The entropy was based on maximizing mutual information between the model posterior density function and the prediction density function, approximated as the difference between the entropy of the predictive distribution and the mean entropy of predictions across samples. Test points that maximize mutual information are points over which the model is uncertain on average, but there are model parameters that produce erroneous predictions with high con dence. This is equivalent to points with high variance in the input to the sigmoid layer (the logits). Thus, each stochastic forward pass through the model would have the highest probability assigned to a different class.
Each prediction from our trained model returned a set of labels. We calculated the GTL Score for each label. We employed the maximum class predictive probability distance (CPPD), which is the difference between the probability values of the highest and the second highest predictive probability value as a measure of a representativeness heuristic. The vector of class probabilities obtained after the the stochastic forward pass is denoted , where denotes the sampled parameters resulting from DropWeights. Thus, the class probabilities of estimates are given by . We obtain the Class Predictive Probability Distance (CPPD): The MCDW estimate of the vector of class probabilities aimed to decompose the source of uncertainty. The main idea was to select samples that were not only highly uncertain but also highly representative. Based on this strategy, we de ned the GTL Score as an approximation of semi-automated sample selection as below: , where is bias-corrected entropy using the Jackknife method.
We ranked all unlabelled samples in ascending order of GTL Score. The formulation for the sample selection measure can be given as . The higher the GTL Score, the higher the information content of the corresponding sample images, which should represent certainty in predictions. The GTL Score was used along with the predictive probabilities, to identify and discard images for which speci c cell types did not express a particular protein, as well as images that expressed the protein with high con dence.

Multi-Label Cross Validation
A Multi-label Strati ed Shu e Split cross-validation merge of Multi-label Strati ed KFold and Shu e Split 46 were used for returning strati ed, randomized folds for multi-label data. The folds were made by preserving the percentage of samples for each label repeated ten times in the process of 10-fold cross-validation, with different randomization in each repetition.

Declarations
Data Availability JPEG les of all 7,848 images used in the present investigation are available on v19.proteinatlas.org. The manually annotated protein expression in eight different cell types will be available in the upcoming version 20 of the HPA, released in October 2020. Manual errors identi ed as part of this study have been corrected, which means that some of the presented protein expression data on the HPA will differ from the input data used for model training.
Upon acceptance of the paper, all codes will be made available in GitHub. The code will also be available to editors and referees upon request. features with standard deep learning features. The mean predictive probability and bias-corrected estimated uncertainty were used for generation of Ghoshal-Tucker-Lindskog (GTL) con dence score, which allowed for dividing the images into those that were reliably predicted by the model, and those of high uncertainty that need manual inspection. showed that Leydig cells more often showed cytoplasmic staining, while Sertoli cells and peritubular cells had the highest proportion of images that were negative/lacked protein expression in these cell types.  Con dence maps of all automated predictions for each of the eight cell types. Each dot corresponds to one prediction, with green = correct and red = incorrect. The predictions were sorted based on their GTL Score, showing the con dence in prediction. The blue lines depict the determined cut-off for each cell type where classi cation is considered too unreliable. FUNDC2 displayed weak cytoplasmic positivity in spermatogonia (arrows), but due to strong staining in elongated/late spermatids (white/black arrow), the spermatogonia staining was considered unspeci c. Similarly, MCM6 showed weak nuclear staining in pachytene spermatocytes, and considered unspeci c compared to the strongly positive preleptotene spermatocytes (white/black arrows). (C) The uncharacterized protein KIAA1324 and Spectrin repeat containing nuclear envelope family member 3 (SYNE3) were stained in small structures missed by the HBNet prediction. KIAA1324 showed positivity in small perinuclear structures of round/early spermatids most likely representing centrosomes (arrows). SYNE3 was stained in nuclear membranes of Sertoli cells (arrows). (D)