Study approval
This study is governed by the following Institutional Review Board (IRB) approval: Charité–Universitätsmedizin Berlin, Germany (EA2/190/16); UKB Universitätsklinikum Bonn, Germany (Lfd.Nr.386/17). The authors have obtained written informed consent given by the patients or their guardians, including permission to publish photographs.
Datasets
We collected images of subjects with clinically or molecularly confirmed diagnoses from the Face2Gene database (https://www.face2gene.com). Extracted, deidentified data were used to remove poor-quality or duplicated images from the dataset without viewing the photos. After removing images of insufficient quality, the dataset consisted of 33,350 images from 21,836 subjects with a total of 1,362 syndromes (Supplementary Table 4).
GestaltMatcher was designed to distinguish syndromes with different properties. We separated syndromes by the number of affected subjects and whether they had already been learned by the DeepGestalt model. Supplementary Figure 7 provides an overview of how the dataset was divided. The current DeepGestalt approach requires at least seven subjects to learn a novel syndrome. We first used this threshold to separate the syndromes into rare and ultra-rare syndromes. We denoted ultra-rare syndromes as “target” syndromes because the objective of our study was to improve phenotypic decision support for these disorders. However, rare syndromes that are not associated with facial dysmorphic features cannot be modeled by DeepGestalt. We therefore further divided rare syndromes into “distinct” (possessing characteristic facial dysmorphism recognized by DeepGestalt) and “non-distinct” (without facial dysmorphic features or that cannot be recognized by DeepGestalt). The distinct syndromes were used to validate syndrome prediction and the separability of subtypes of a phenotypic series because these syndromes are known to have facial dysmorphic features that are well recognized by the DeepGestalt encoder. We excluded autism from the non-distinct group of syndromes in this study because it had many more subjects than other non-distinct syndromes, leading to an imbalanced dataset. For target syndromes, we sought to demonstrate that GestaltMatcher could predict a syndrome even if facial images were publicly available for only a few subjects. It is noteworthy that, for more than half of all known disease-causing genes, fewer than ten cases with pathogenic variants have been submitted to ClinVar (Figure 1). Of the 1,362 syndromes in the entire dataset, 296 were distinct, 242 non-distinct, and 824 target. DeepGestalt cannot yet be applied to non-distinct and target syndromes.
We further divided each of these three datasets into a gallery and test set. The gallery is the set of subjects that we intend to match, given a subject from the test set. First, 90% of subjects with each distinct syndrome were used for training models, and the remaining 10% of subjects were used to validate DeepGestalt training; the 90% then became the distinct gallery and the 10% were assigned to the distinct test set. For the target and non-distinct datasets, we performed 10-fold cross-validation. In each syndrome, 90% and 10% of subjects were assigned to the gallery and test set, respectively.
Matching only within a dataset would not represent a real-world scenario. Therefore, the galleries of the three datasets were later combined into a unified gallery that was used to search for matched patients.
DeepGestalt encoder
The preprocessing pipeline of DeepGestalt includes point detection, facial alignment (frontalization), and facial region cropping. During inference, facial region crop is forward passed through a deep convolutional network (DCNN), and ultimately got the final prediction of the input face image. The DeepGestalt network consists of ten convolutional layers (Conv) with batch normalization (BN) and a rectified linear activation unit (ReLU) to embed the input features. After every Conv-BN-ReLU layer, a max pooling layer is applied to decrease spatial size while increasing the semantic representation. The classifier part of the network consists of a fully connected linear layer with dropout (0.5). In this study, we considered the DeepGestalt architecture as an encoder–classification composition, pipelined during inference. We chose the last fully connected layer before the softmax classification as the facial feature representation (facial phenotypic descriptor, FPD), resulting in a vector of size 320. The encoder trained on 296 distinct syndromes was named Enc-DeepGestalt.
Our first hypothesis was that images of patients with the same molecularly diagnosed syndromes or within the same phenotypic series, and who also share similar facial phenotypes, can be encoded into similar feature vectors under some set of metrics. Moreover, we hypothesized that DeepGestalt’s specific design choice of using a predefined, offline-trained, linear classifier could be replaced by other classification “heads”, for example, k-Nearest Neighbors using cosine distance, which we used for GestaltMatcher.
Descriptor projection: Clinical Face Phenotype Space
Each image was encoded by the DeepGestalt encoder, resulting in a 320-dimensional FPD. These FPDs were further used to form a 320-dimensional space called the Clinical Face Phenotype Space (CFPS), with each FPD a point located in the CFPS, as shown in Figure 2. The similarity between two images is quantified by the cosine distance between them in the CFPS. The smaller the distance, the greater the similarity between the two images. Therefore, clusters of subjects in the CFPS can represent patients with the same syndrome, similarities among different disorders, or the substructure under a phenotypic series.
Evaluation
To evaluate GestaltMatcher, we took the images in the test set as input and positioned them in the CFPS defined by the images of the gallery. We calculated the cosine distance between each of the test set images and all of the gallery images. Then, for each test image, if an image from another subject with the same disorder in the gallery was among the top-k nearest neighbors, we called it a top-k match. We then benchmarked the performance by top-k accuracy (percent of test images with correct matches within the top k). We further compared the accuracy of each syndrome in the distinct, non-distinct, and target syndrome subsets to investigate whether GestaltMatcher can extend DeepGestalt to support more syndromes.
London Medical Dataset validation analysis
We compiled 323 images of patients diagnosed with 90 distinct syndromes from the LMD19 and used this as the validation set for distinct syndromes. We first evaluated the validation set using softmax, which is a DeepGestalt method. To compare the performance with that of GestaltMatcher, we evaluated the performance of GestaltMatcher on two different galleries: a gallery of distinct syndromes consisting of 20,091 images of patients with 296 syndromes, and a unified gallery consisting of 27,826 images of patients with 1,362 syndromes. We then reported the top-k accuracy and compared the results of these three conditions (DeepGestalt with softmax, GestaltMatcher with distinct gallery, and GestaltMatcher with unified gallery).
Target syndromes analysis
To understand the potential for matching target syndromes, we trained an encoder, denoted Enc-Target, on 477 out of 824 target syndromes with more than three and fewer than seven subjects. Ninety percent of the subjects were used to train Enc-Target and were later assigned to the gallery. The remaining 10% of subjects were assigned to the test set. We then compared the performance of Enc-Target and Enc-DeepGestalt (see previous section) using cosine distance and the softmax classifier.
Syndrome facial distinctiveness score
To evaluate the importance of the facial gestalt for clinical diagnosis of the patient, we asked three dysmorphologists to score the usefulness of each syndrome’s facial gestalt for establishing a diagnosis. Three levels were established:
- Facial gestalt can be supportive in establishing the clinical diagnosis.
- Facial gestalt is important in establishing the clinical diagnosis, but diagnosis cannot be made without additional clinical features.
- Facial gestalt is a cardinal symptom, and a visual or clinical diagnosis is possible based only on the facial phenotype.
We then averaged the grades from the three dysmorphologists for each syndrome.
Syndrome prevalence
The prevalence of each syndrome was collected from Orphanet (www.orpha.net). Birth prevalence was used when the actual prevalence was missing. If only the number of cases or families was available, we calculated the prevalence by summing the numbers of all cases or families and dividing by the global population, using 7.8 billion for the global population and a family size of ten for each family29.
Unseen syndromes correlation analysis
To investigate the influence of prevalence and distinctiveness score on the performance for novel syndromes with facial dysmorphism, we selected 50 distinct syndromes and kept them out of the training set. The 50 syndromes were selected to have evenly distributed distinctiveness scores and prevalence distribution; the distributions are shown in Supplementary Figure 7 and Table 4. The encoder (Enc-unseen) was trained on 90% of the subjects from the other 246 distinct syndromes. In addition, we performed random downsampling to remove the confounding effect of prevalence. For each iteration, we randomly downsampled each syndrome by assigning five subjects to the gallery and one subject to the test set. We then averaged the top-10 accuracy of 100 iterations. We calculated Spearman rank correlation coefficients for the following two pairs of data: the first between top-10 accuracy and the syndrome’s distinctiveness score, and the second between top-10 accuracy and the prevalence of syndromes collected from Orphanet.
Analysis of number of training syndromes
In this analysis, we trained the encoders with different numbers of syndromes. We first sorted the syndromes by the number of subjects in each syndrome, in descending order. We then trained 13 encoders, each with a different number of training syndromes. We used the ten most common syndromes in the training set for the first encoder. For the second encoder, we trained on the top 30 syndromes, and continually increased the number of syndromes for each subsequent encoder by 20 until we reached 246 syndromes. Thus, we simulated how syndromes would be included in model training in the real world. We took the 50 selected distinct syndromes as the test set and performed random downsampling as described in the previous section; the only difference was that we used encoders trained from ten to 246 syndromes.
GeneMatcher validation analysis
We selected 14 publications in which GeneMatcher was used to match patients with facial dysmorphism from unrelated families. In total, these studies contained 104 photos of 89 subjects from 77 families. The details are shown in Table 3. We performed leave-one-out cross-validation on this dataset, i.e., we kept one photo as the test set, and we assigned the rest of the photos to a gallery of 3,636 photos with 824 target syndromes to simulate the distribution of patients with unknown diagnosis. We then evaluated the performance by top-1 to top-30 rank. If a photo of another subject with the same disease-causing gene from an unrelated family was among the top-k rank, we called it a match.
Moreover, we used top-k rank to measure how many unrelated families were connected. If one unrelated family was among the test photo's top-k rank, the families were considered to be connected at that rank. How many families were matched to at least one unrelated family was also represented.
Code availability
GestaltMatcher is a partially proprietary framework. While the source code for cropping the face cannot be shared, the architecture of the CNN, as well as a web service of the trained version of the tool is accessible for use by healthcare professionals free of charge at www.gestaltmatcher.org.
Data availability
The data that support the findings of this study are divided into two groups, published data, and restricted data. Published data are available from the reported references and also from www.gestaltmatcher.org. Restricted data are curated from Face2Gene users under a license and cannot be published, to protect patient privacy.