The search identified 765 articles in Pubmed and 1311 articles in Cochrane (as shown in Figure 2). A total of 2076 articles was found, out of which 62 articles were selected following an initial screening for eligibility. After reviewing the abstract and title, 21 of the 62 articles were considered as relevant. Lastly, we checked if the articles matched our CDSS definition and excluded five articles. A total of 16 articles describing 13 different CDSS was used for data extraction and analysis. In the next section we describe the results to our selection criteria. Then the CDSS will be presented in more detail using the category “functionality”.
3.1 Functionality, development status, type of clinical data, system availability
Six of the thirteen CDSS included in the analysis use phenotypic and genetic matching as a CDSS functionality [16–21] whereas three of these systems additionally make use of clinical data [18, 20, 21]. Five of the CDSS are based on machine learning and information retrieval [22–26] and two use web search methods [27, 28]. Six systems are clinical prototypes [22–24, 26, 28] and seven are full developed [16–21]. Regarding „Type of clinical data”, two systems use literature databases [27, 28] and two use clinical data [24, 25], while one of these system uses both clinical and literature data [26]. Two CDSS use patient questionnaires [22, 23]. Regarding the category “System availability”, eight systems are available for use via online access or download [16–21, 27] while five systems are not publicly available [22–26]. Table 3 shows a comparative overview of the mentioned results.
3.1.1 Functionality: Machine Learning and information retrieval
During a training phase, machine learning systems learn to predict or classify a problem based on existing data, such as determining whether a certain disease exists or not. After the training phase, the system can make a prediction for a new dataset [29]. Information retrieval is defined as the extraction of information in a resource to find a required piece of information. Machine learning techniques can be used in information retrieval [30].
In this review, we identified five CDSS using machine learning and information retrieval. Each article describes the CDSS as clinical prototypes which are not publicly available. All but one articles by Shen et al. [26] focus on specific RD or groups of RD [22–26], including neuromuscular diseases [22], paediatric pulmonary diseases [23], rare cancer types [25] and various rare genetic diseases [24].
Rother et al. [23] and Grigull et al. [22] used medical questionnaires completed by patients with RD to train the machine learning algorithms such as support vector machines, random forest or k-nearest neighbor. In both studies the machine learning algorithms were compared with a fusion algorithm, which is a combination of different algorithms. Both studies used patient-oriented questions to develop a questionnaire with 46 items. In the study by Grigull et al. [22] 210 patients answered the questions and a diagnosis rate of 89 % was achieved in a one-year prospective study. In the study by Rother et al. [23] the number of trained cases was slightly lower at 170, but it achieved a higher diagnosis rate with 94 % [22, 23].
In contrast to the previous studies, Sidiropoulos [25] and Garcelon [24] et al. use clinical data in their CDSS. Sidiropoulos et al. [25] developed a real time decision support for the diagnosis of rare cancer types based on histological clinical data and through on machine learning. Garcelon et al. used the Vector Space Modell (VSM) which is included in the “information retrieval" category. The authors explicitly decided against machine learning methods, since a high amount of training data would be required, which in most cases is a challenge in the field of RD [24].
Sidiropoulos et al. used quantitative histological descriptors combining structural, textural and morphological information. For real-time decision support, a GPU framework (Graphics Processing Unit) was used to show a result in real time whenever a new patient case is registered. To train the machine learning algorithm, a probabilistic neural network on 140 rare brain cancer cases was used to predict the malignancy of the tumor. Therefore, the focus was not on predicting the actual disease, as in the other articles, but on determining malignancy based on the WHO guideline for classification of neuroepithelial tumors of the central nervous system. The system achieved an accuracy of about 74 % and performed 267 to 288 times faster on the GPU-based system than on the CPU-based system (Central Processing Unit) [25].
Garcelon et al. [24] developed a system to find similar patients to an undiagnosed patient (index patient). The data is based on a clinical data warehouse containing about 400,000 patients. The similarity is calculated using the Vector Space Model (VSM), a technique of information retrieval, that computes similarity between documents represented as vectors of keywords. In this case, patient data is represented as a vector and the distance between the index patient and all other patients is calculated [24]. The evaluation of the approach was based on five different rare genetic diseases with 7 to 103 patient cases per disease. The authors evaluated the ability to find the top five patients matching the index patient as closely as possible. The percentages of index patients returning at least on true positive similar patient were reported as 94 % for Lowe Syndrome, 97 % for Epidermolysis Bulloas, 86 % for Activated PI3K Delta Syndrome, 71 % for Dowling Meara and 99 % for Rett syndrome. The processing time to retrieve similar patients in evaluation was about 12 seconds [24]. Only Sidiropoulos et al. [25] and Garcelon et al. [24] considered the processing time for their CDSS.
The approaches mentioned so far only focus on clinical data. In contrast, Shen et al. developed a system that combines clinical and literature data. The clinical data used by the authors includes 13 million unstructured clinical notes for over 700,000 patients, with the limitation to described problems and diagnosis. The literature dataset comprises about 91,000 phenotype-rare disease associations which were extracted from research articles of the SemMedDB using HPO (Human Phenotype Ontology) and GARD (Genetic and Rare Diseases Information Center) terms [26]. SemMedDb is a repository of semantic predications extracted from titles and abstracts of all Pubmed Citations, whereas GARD is a database that contains information about RD based on 4560 diseases and 32 disease categories. The HPO is the most widely used ontology describing phenotypes for genetic diseases [20]. The system developed in this study was able to combine these heterogeneous data sources into a collaborative filtering model for RD recommendation. In conclusion, the authors reported that the combination of electronic medical records and literature did not always lead to the best performance. This may be due to different approaches and expressions in medical documentation varying from physician to physician [26].
3.1.2 Functionality Web search
For complex and difficult patient cases, clinicians often consult peer-reviewed patient cases in journals, mostly case reports, to find patients with similar characteristics. This process is time-consuming and inefficient and tools to find and compare these reports would be helpful. In this review, we identified two fully developed CDSS adopting this idea of patients with RD: FindZebra [27] and a system from Taboada et al. [28].
When searching Pubmed it is often difficult to identify patients with similar characteristics. Publications with related diseases are often not marked as case reports and the number of published cases in RD is limited. Dragusin et al. [27] developed FindZebra, a search engine for RD. The authors have designed their tool similar to other search engines in order to allow an intuitively use. The search engine is based on “Indri”, an open source information retrieval system. The knowledge base of FindZebra is built on 33,144 documents of different medical cases covering 90 % of the Orphanet database. The main sources for the dataset are OMIM (Online Mendelian Inheritance in Man), GARD, Orphanet, Wikipedia and NORD (National Organization for Rare Diseases). On the other hand, Taboada et al. [28] used the Human Phenotype Ontology, the National Center of Biomedical Ontology (NCBO) and the Open Biological Ontologies (OBO). Their tool uses so called “text annotation” with the mentioned ontologies to identify phenotypes in abstracts of Pubmed. This is different from FindZebra, where the symptoms can be entered as free text. The use of the tool by Taboada et al. requires more effort by the user. In order to find corresponding phenotypes, the text must be copied into the program or read in as a file. In contrast, FindZebra provides matching results from the mentioned databases based on the symptoms entered, which are displayed in one webpage.
FindZebra uses a larger portfolio of data with 10 different data sources. While FindZebra can be accessed directly online [24], the use of the search engine by Taboada et al [28] requires a download.
3.1.3 Functionality: Genetic and phenotypic matching
A promising method when sharing patient cases is the comparison of exomes, genomes or phenotype-related patient data. Especially in RD, where 80 % of the diseases are of genetic origin, it is important to identify the external manifestation of these disorders (phenotypes) in combination with genetic testing to determine the cause of the disease (genotype) [19]. Several software systems have been developed implementing this idea. In this review, we identified the projects GeneYenta [16], GeneMatcher [17], GenIO [18], DECIPHER [19], PhenomeCentral [20] and Matchmaker Exchange [21], which we refer as “matching tools” below. In the following, we show the differences between the tools based on the listed criteria:
- (a): Available for usage: The tools are available and can be used by clinicians.
- (b): Registration necessary: A registration is required to use the platform.
- (c): Gene identification: Comparison between patients can be made on the basis of gene data.
- (d): Diagnosis code: A diagnosis code for suspected diseases can also be added for the comparison between patients.
- (e): Phenotypic Terms (HPO): Phenotypic terms are entered using the HPO nomenclature.
- (f): Acceptance of VCF (Variant Call Format) Files: Gene variants can be uploaded in the VCF format, a text file format for storing gene sequence variations.
- (g): Provides match score output: A match score for each similar patient is computed.
Table 4 shows a comparative overview of the tools with regard to the criteria above. We also highlight the aspect of data privacy, which plays a major role in the sharing of patient data.
All matching tools [16–21] are web-based and accessible online. They consider the possibility to find similar patients based on genetic or phenotypic data in the databases. A user registration is required on almost all platforms, except for GenIO [18], where an e-mail address is required to upload the data directly [18]. Furthermore, almost all matching tools support gene identification with the exception of GeneYenta [16], which focuses only on the comparison of phenotypes. To describe these phenotypes, all matching tools are using the HPO. For genetic data, PhenomeCentral [20] and GenIO [18] provide the possibility to enter these data in a VCF file format. GenIO [18], GeneMatcher [17] and PhenomeCentral [20] allow to enter suspected diseases as an additional search criteria. Each matching tool [16, 17, 19] except for GenIO [18], shows the user a match score output. For instance, GeneYenta shows a match score from 0% to 100% representing the similarity of phenotypic characteristics [16]. Since GenIO does not compare data from multiple patients, such a score is not possible. In GenIO, genetic data is entered via VCF files together with patient phenotypes using the HPO and OMIM. To process the data, GenIO uses the so-called “GenIO pipeline“, which consists of a variant annotation and phenotype processing [18]. The variant annotation uses different tools like Annovar, Anntools and SnpEff to annotate the variants of a patient. Major clinical genomic databases such as ClinVar, OMIM, the Genome Aggregation Database (gnomAD) and dbSNP (Single Nucleotide Polymorphism Database) are used as information sources for annotation [31–34]. The phenotype process of GenIO is performed with Phenolyzer which contains the list of genes related to the patient’s disease/phenotype [35]. GenIO does not consider any data sharing of patient cases, because it represents a standalone application using different clinical genomic databases [18].
Bringing the data together: The Matchmaker Exchange Project
All matching tools [16–21] are part of the Matchmaker Exchange Project (MME), which connects organizations and projects through a federate network of databases of genotypes and rare phenotypes using a common application programming interface (API) [21]. The MME enables searches across multiple databases from different platforms by making requests to all databases. To find similar matches, each request can include gene or genotype data in combination with conditions or phenotype features. MME is designed as a federated network including distributed databases which are connected through APIs to support requests. Each database can run on its own data model and its own pace [21].
Data privacy
In the following section, we describe data privacy issues and solutions for all matching tools described here. We consider how the data is shared with third parties and how the access to the data is managed. PhenomeCentral, DECIPHER and Matchmaker Exchange include concepts for the data visibility based on different levels [19–21]. For instance, patient records in PhenomeCentral have different visibility settings. These levels are “private”, “matchable”, and “public”. If “private” is selected, data is only visible for the submitter and is not available for matchmaking. For “matchable”, the submitter can see other similar patients and other users can retrieve this patient’s data, i.e. the own dataset is used for matching. However, Genomic and phenotype information is not visible. More precisely, phenotypes become generalized and are shown on gene-level only. The third level, “public”, is more open. The patient record is visible to all registered users and available for matching. Similar patients are shown to the submitters and phenotypes and genomic variants are visible [20].
GeneYenta has less security requirements. It only allows to store HPO-based phenotype data and does not include any patient-identifying data. The idea is that data sharing is performed outside the platform and according to the rules of the respective institutions [16, 20]. This concept is similar to GeneMatcher, where no identifiable data of the patient is provided to other clinicians. Data submitters have full control over their data (e.g. gene name, phenotypic features) and can delete or edit it at any time. Users only see their own data. Further level for data sharing are not described [17].