Basal cell carcinoma is the most common form of skin cancer in humans. The incidence is as high as the incidence of all other cancers combined1. Further, the number of BCC cases is increasing globally2–4. Although metastasis and death are rare, BCCs can cause significant morbidity due to aggressive and destructive local growth5.
BCCs are a heterogeneous group of tumors with different growth patterns. Internationally, BCCs are classified into two broad categories based on histopathologic features: low-risk and high-risk subtypes6. These categories can be further classified in subclasses. Swedish pathologists, for example, classify BCCs according to the “Sabbatsberg model” which includes three risk categories: a) “low-aggressive” subtypes which are further divided into superficial (type Ib) and nodular (type Ia), and b) “medium-aggressive” (type II) which includes less aggressive infiltrative subtypes that grow in a more well-defined manner and more superficially compared to the high-aggressive tumors and c) “high-aggressive” (type III), more aggressive, infiltrative and morphea form subtypes7. The correct assessment of the subtype is crucial for planning the relevant treatment. However, there is a significant inter-pathologist variability when grading tumors8 and reporting the subtype9,10.
Moreover, given the time-consuming process of evaluating histological slides combined with an increasing number of samples delays diagnosis and increases costs11. To reduce diagnosis time and inter-observer variations, deep learning12 approaches have been actively investigated. Deep learning enables the implementation of computational image analysis in pathology, which provides the potential to increase classification accuracy and reduce interobserver variability13,14. Interestingly, even unknown morphological features associated with metastatic risk, disease-free survival, and prognosis may be revealed 15,16.
In early research works computational histology methods required pixel-wise annotations i.e. delineating specific regions on WSI by pathologists17. Using pixel-wise annotation, however, is time-consuming. Further, such approaches do not generalize to real-world data18. As an alternative, a weakly supervised learning framework has been a widely adopted method for WSI classification. The common technique within weakly supervised learning is multi-instance learning (MIL)19. This approach can use WSI-level labels, i.e labels not associated with a specific region, without losing performance20. The technique treats the set of instances (patches of a WSI) as a bag. The mere instance of a positive case patch makes the bag, i.e. WSI, positive, otherwise, it is treated as negative. MIL requires that the WSI are partitioned into a set of patches, often without the need for data curation18.
The later works have increasingly added a self-supervised contrastive learning paradigm in extracting better feature vectors. In these paradigms pre-trained CNN models are tuned using a contrastive learning framework in a contained manner 21. Adding these components into MIL approaches has proven to provide better performance 22,23. However, the MIL framework fundamentally assumes the patches as independently and identically distributed, neglecting the correlation among the instances19,24. Neglecting the correlation affects the overall performance of the classification models. Instead, the spatial correlation can be captured using the graph neural networks, which in turn increases model performance25–27.
Recently, Transformers28 have made a great leap in the AI front by introducing the capability to incorporate context among a sequence of tokens in natural language processing tasks e.g. GPT-329. Inspired by the success of transformers in natural language processing, Dosovitskiy, A. et al.30 proposed Vision Transformer (ViT), a method for image classification tasks that takes flattened patches of an image as input.This enables capturing the sequence of patches (tokens) and considers the position of images (context) using positional embeddings. Consideration of the positional relationship (contextual information) shows that ViT can perform better than CNN, especially when using features obtained from self-supervised contrastive models. In addition, vision transformers require substantially fewer data and compute resources relative to many CNN-based approaches30,31. Further, the relative resilience to noise, blur, artifacts, semantic changes, and out-of-distribution samples could contribute to better performance32.
In medical images, transformers have been applied in image classification, segmentation, detection, reconstruction, enhancement, and registration tasks32. Specifically, in histological images, vision transformers have been successfully applied to different histological images related tasks, including in the detection of breast cancer metastases, and in the classification of cancer subtypes of lung, kidney and colorectal cancer33,34. Given the success of vision transformers in many medical applications and the capability of graph neural networks to capture correlation among patches, we adopt the combination of graph neural networks and Transformers to detect and classify BCCs.