A series of screening operations were implemented on the collected articles in order to identify the most relevant set of articles for this review. In this case, the first level screening was conducted manually on a total of 427 files using file names and titles of the article which allowed as collecting 397 items were selected out of the total 427 articles. Using EndNote to create a library resulted in the automatic removal of 25 articles as there were duplicate files from different folders, followed by an automatic duplicate detection, leading to a library containing 371 articles. Further screening using ‘Rayyan’, 4 duplicated articles were detected in the library and two of them were removed where 369 articles were finally identified. Further, using this online software tool, 90 articles that have a relationship with the current topic of the study were selected based on title and abstract analysis. Additional screening was required to identify articles in relation to the study area and 18 articles were identified out of the 90 related articles. Finally, 9 articles were selected for the final analysis. The overall article selection procedure is outlined using the PRISMA flow chart as depicted in Fig. 1 below.
4.1. Distribution of Articles
By applying the specified searching methods on the seven different databases, 427 articles that were published between the year 2014 and 2024 were collected as shown in Fig. 2 below. Accordingly, Google Scholar was primarily used and it allowed us to collect 178 articles from the different sources including IEEE Explore, MDPI, Mendeley, Nature, PubMed, ScienceDirect, AJOL, IDP, NCBI, PLOS, Springer, and Tropical Medicine and Health.
Furthermore, additional analysis was performed regarding the sources of the articles with respect to the first two consecutive initial levels of screening, as shown in Fig. 3 below.
4.2. Distribution of Articles by Publication Year
After conducting three levels of screening, 90 articles that have direct relationship with the current systematic review have been selected for further screening based on full-text reading and analyses. The selected articles and their respective publication year along with the distribution of the publications years have been shown in Fig. 4 below.
As shown, the articles used for this systematic review included studies that have been published recently, where the majority of the studies representing 31% are articles published in 2023, 25% were published in 2022, 16% were published in 2021, 14% were published in 2020, and the remaining 14% were articles published from 2014–2019.
4.3. Distribution of Articles by Methods Used
Finally, the 90 articles were further analyzed by categorizing them into four different groups, (i) articles that utilized DL methods for the diagnosis of skin diseases, (ii) articles that implement ML & DL techniques for the diagnosis of NTDs, (iii) articles about the implementation of multimodal data fusion techniques for medical data fusion, and (iv), articles that implement multimodal data fusion based on DL-based methods for the diagnosis of skin diseases as shown in Fig. 5 below. As portrayed in Fig. 5 below, 54.44% of articles utilized ML and DL methods for the diagnosis of skin diseases in general, 20% deal with multimodal data fusion techniques for healthcare systems and 20% implementation of DL-based multimodal data fusion methods for the diagnosis of skin diseases. On the other hand, 5.56% of the articles utilized ML and DL methods for the diagnoses of NTDs in general have been identified and analyzed. However, no article has been found that deals with the implementation of DL-based MMDF methods for the diagnosis of NTDs which has led to the analyses of previous studies that used this approach for the diagnoses of different skin diseases other than the NTDs. By conducting the fourth level screening, 18 articles that utilize different fusion techniques for the diagnosis of various skin diseases have been identified.
4.4. Analysis of Fusion Techniques Used
The final screening has resulted in the separation of 7 of the 18 articles due to the fusion techniques they utilize for the diagnosis of skin diseases. The fusion techniques presented in those 7 studies are feature fusion (5 studies), image fusion (1 study) and model fusion (1 review study) as presented in Table 1 below. Table 1 presented the analysis of three different types of fusion other than MMDF using five different parameters as shown in the table below.
On the other hand, 2 articles presented a review of the multimodal data fusion techniques for the diagnoses of skin diseases other than NTDs. Although the 2 articles [12][13], didn’t implement MMDF techniques for a specific skin disease diagnosis using their datasets of preferences, they presented theoretical analyses. All in all, 9 articles are used for the final analysis of this review.
Table 1
Review of the future fusion and related techniques for skin disease diagnoses
Ref | Pub. Yr. | Study Method / Approach Used | Disease(s) Selected | Dataset(s) Used | Algorithm(s) Used | Performance Results Achieved |
[31] | 2019 | Transfer Learning and multi-layer feature fusion network | Skin Lesion | HAM10000 dataset | CNN | high recognition (ROC-AUC 96.51) |
[24] | 2021 | Image fusion (clinical & dermoscopic): multi-labeled deep feature extractor and clinically constrained classifier chain (CC) | Skin Cancer (Melanoma) | publicly available 7-point checklist dataset | DCNN, CC, PCA | Reported 81.3% accuracy |
[6] | 2022 | Multiclass skin lesion classification using feature fusion & extreme learning machine (ELM) | Skin Disease (Skin Lesion) | HAM10000 and ISIC2018 | SVM, fine KNN, DT, NB, ensemble tree (EBT), & single hidden layer ELM | Registered best accuracy of 94.36 percent |
[32] | 2022 | Apply features fusion on manual and automatic feature extraction | Skin Cancer | DermIS dataset | CNN, LSTM, LBP, LBP, Inception V3 | Achieved maximum accuracy of 99.4% |
[33] | 2023 | Dual-branch (feature) fusion network using DCNN and Transformer branches for local and global feature extraction | Skin Disease (Skin Lesion) | Used a private dataset XJUSL | DCNN | Reducing parameters by 11.17 M improved classification accuracy by 1.08% |
[34] | 2023 | Feature fusion: fast-bounding box (FBB), Hybrid Feature Extractor (HFE), and the CNN VGG19 based CNN | Skin Cancer (Melanoma) | ISIC 2017, Academic torrents dataset | CNN | Registered 99.85% accuracy |
After conducting the final screening procedures, 9 articles have been selected for the final analysis of this systematic review as presented in Table 2 above. The 9 articles selected utilized DL-based methods based on MMDF techniques for the diagnoses of different skin diseases other than NTDs. The 9 studies are selected for the final analysis of this review since there are no similar studies found for the diagnosis of skin related NTDs based on MMDF. Since skin related NTDs are being diagnosed using skin photos or images, patient records and related information, these studies are selected and reviewed to analyze the different techniques utilized by those studies. The final analysis is conducted on the 9 articles using 5 different analysis criteria (the methods used, diseases selected for diagnosis, dataset used, algorithms used and corresponding performance achievements) to identify research gaps as summarized in Table 2 below.
Table 2
Summary of the review of the DL-based multimodal data fusion techniques for the diagnosis of skin diseases
Ref. | Study Method / Approach Used | Algorithm(s) Used | Performance / Accuracy Results Achieved |
[7] | Combining images and metadata features | CNN: using 5 pre-rained models | Performs better than the other combination approaches in 6 out of 10 scenarios. |
[8] | a naive combination of patient data and an image classifier | CNN | CNN: AUROC of 92.30% ±0.23% & balanced accuracy of 83.17% ±0.38%), naive strategy: accuracy to 86.72% ±0.36%. |
[9] | A DNN-based multi-modal classifier using wound images and their locations | (AlexNet + MLP, AlexNet + LSTM, ResNet50 + MLP, VGG16 + LSTM) | Max. Acc. on mixed class: varies from 82.48 to 100% the max. acc. on wound-class varies from 72.95 to 97.12% in various experiments |
[35] | 2 imaging modalities with patient metadata | CNN, RF classifier, ResNet-50, ILSVRC | binary melanoma detection (AUC 0.866 vs 0.784) & multiclass classification (mAP 0.729 vs 0.598) |
[36] | Multiplication-based DF, using the metadata | CNN, the color constancy algorithm | outperforms traditional baseline approaches (p-values are smaller than 0.05) |
[37] | a DNN with two encoders and application of a multimodal fusion module | CNN: CNN models (ResNet-50) | ACC (0.768 ± 0.022), BACC (0.775 ± 0.022) & outperform other metadata fusion methods (MetaNet (P = 0.035) and MetaBlock (P = 0.028)) |
[38] | Multimodal Transformer using Vision Transformer (ViT) model | CNN: ResNet101, Densenet121) and ViT models | Private DS (accuracy: 0.816, which is better than other popular networks) & On ISIC 2018 DS (accuracy: 0.9381 and an AUC of 0.99) |
[39] | Preprocessing, feature extraction, and classification/diagnosis | CNN: 6 CNN pre-trained models with tuning algorithms | Av. acc, sensitivity, specificity, precision, & disc similarity coefficient (DSC) of around 99.94%, 91.48%, 98.82%, 97.01%, and 94.00% |
[40] | fusion of clinical skin image & patient clinical data, feature extraction & attention mechanisms | CNN: (VGGNet19, ResNet50, DenseNet121 & Inception-V3) | Achieved accuracy of 80.42% (an improvement of about 9% compared with the model accuracy using only medical images) |
4.5. Methods used for building diagnostic models for skin diseases
In the final analysis of this systematic review, the nine studies identified proposed and demonstrated the MMDF approach for the diagnosis of different skin diseases using their corresponding datasets. The studies utilized different methods and algorithms that include CNN, random forest, multilayer perceptron (MLP), long-short term memory (LSTM), the color constancy algorithm, and hyperparameter optimization (HPO) algorithms. Accordingly, 88.9% of the studies (8 articles) primarily utilized the CNN algorithm along with CNN architectures, while 11.1% of the studies utilized MLP and LSTM along with CNN architectures including ResNet50, VGG16, and AlexNet. In general, the studies employed different methods to demonstrate the DL-based methods for combining different modalities of patient data using different methods, such as the attention-based mechanism for combining images and metadata features, a multimodal transformer using the Vision Transformer (ViT) model, and mapping heterogeneous data features. In addition, DCNN architectures such as Densenet121, ILSVRC 2015, VGG16, VGGNet19, ResNet50, ResNet101, DenseNet121, Inception-V3, AlexNet with MLP, AlexNet with LSTM, ResNet50 with MLP, and ViT models were utilized for feature extraction and transfer learning purposes.
4.6. Fusion strategies suggested for skin disease diagnosis
Generally, data fusion techniques determine some issues, including the method of integrating data, the data being fused or integrated, and the level at which data will be integrated. The studies used for this review demonstrated various fusion approaches, mainly feature fusion, model fusion, image fusion, and MMDF techniques. In this regard, 89% of the selected studies analyzed in this review implemented MMDF approaches for integrating mainly clinical images and textual medical data. Whereas only one study (11%) demonstrated the MMDF approach for combining two imaging modalities (dermatoscopic and macroscopic images) with patient metadata [35].
As reported by the studies used in this review, various fusion strategies have been experimented with on a particular dataset while developing a diagnostic model for specific skin disease(s). Accordingly, the fusion methods or strategies include integrating multiple imaging modalities (2 image modalities in this case) with textual patient data [35], using a multiplication-based fusion approach (used to control data imbalance) [36], using the metadata processing block (MetaBlock) for enhancing features extracted from the images throughout the classification [7], other study used a naive combination of the patient data classifier module and a whole slide image classifier module [8]. Furthermore, using a DNN that has two encoders for extracting image features and textual features, a MMDF module with intra-modality self-attention and inter-modality cross-attention capability was experimented with, and it was reported that the model outperformed other fusion models [37]. On the other hand, a neural network with a multimodal transformer consisting of two encoders for both images and metadata and one decoder to fuse the multimodal information using the ViT model to extract image features, a soft label encoder for the metadata, and a mutual attention block to fuse the different features [38]. In another study, a fusion system was developed using four procedures consisting of preprocessing the image and metadata, feature extraction using six pre-trained models, feature concatenation (using CNN through convolutional, pooling, and auxiliary layers), and finally classification of skin disease [39]. Similarly, the feature concatenation method was used to develop a wound classifier multimodal network by concatenating the image classifier and location-based classifier outputs [9]. Finally, a skin cancer diagnostic model was developed following three procedures, including extracting features (skin images and patient clinical data using CNN architectures), using the attention mechanism (for handling the multimodal features), and finally developing a feature fusion model [40].
4.7. Achievements of MMDF techniques in diagnosing skin diseases
As stated by the studies reviewed, in developing diagnostic models using MMDF techniques for skin diseases, various DL methods and algorithms were used, including CNN, Random Forest, MLP, and LSTM. The algorithms achieved sufficiently higher performances in their respective studies while being tested on a particular dataset. Consequently, it was confirmed that MMDF techniques outperform traditional baseline diagnostic approaches [7][36]. Furthermore, the majority of the studies reviewed reported that the disease classification models achieved accuracy of more than 80% [8][9][35][38]. A study using a DNN with two encoders that implement a multimodal fusion module with intra-modality self-attention and inter-modality cross-attention reported an accuracy of 76.8% [37]. Similarly, another study used in this review that used medical image analysis based on feature extraction, feature concatenation, and classification or diagnosis methods reported 99.94% accuracy in the classification or diagnosis of seven selected skin diseases. In general, as the analysis results show, MMDF techniques are significantly improving classification accuracies. Therefore, the utilization of multimodal data fusion techniques based on the deep learning methods, algorithms, and models in different settings (such as an ensemble of two or more of those methods, algorithms, and models) is a potential research area that needs further investigation, especially for the diagnosis of NTDs.