Since the COVID–19 pandemic is currently threatening many human lives, there is an immediate need for better tools to identify novel viruses for pathogenesis, treatment and vaccine development for current pandemic and potential pandemics in future. We used DL for classification of SARS-CoV–2 virus and 15 other types of viruses. We also showed that PCA and t-SNE can provide information about the similarity of a novel virus to other virus families. Using TEM images PCA and t-SNE, clustering results showed SARS-CoV–2 is closest to Influenza among 15 virus families considered in our study. Our approach helps to provide more accurate identification of a virus from TEM images, given high level of expertise required for analysis of TEM images, and also high chances of false positive or false negative in manual analysis of TEM images. To the best of our knowledge, this is the first study that uses pretrained DL models for classification of viruses from TEM images.
The DL models used in this paper are relatively large models in terms of model parameters. Training of these models will need large datasets as well as time. Using pretrained models, we developed DL frameworks for identification of TEM images more efficiently in terms of time and data required to train the models. These models are pretrained using large datasets; i.e., the ImageNet dataset 20. All the three DL models considered in this paper provided predictions with accuracy larger than 70.6% (at 95% CI, Table 2), and the ROC curve showed areas larger than 0.9 (Fig. 3). Therefore, these DL models can be suitable candidates to further improve identification of viruses from TEM images.
The results from PCA and t-SNT provided the closest family of SARS-CoV–2. According to PCA and t-SNT visualizations, the novel virus is close to Influenza family of viruses (Figs. 4 and 5). These results should be interpreted by caution, however, as more computational and experimental investigations are needed to assess the similarities between SARS-CoV–2 and Influenza. As observed from PCA and t-SNT results (Figs. 4 and 5), Marburg and Ebola are also similar to each other. Because Marburg and Ebola are from Filoviridae family virus families 26,27, our PCA and t-SNE are in line with literature. Our results can provide insights in future novel viruses to enable more rapid treatment and vaccine development.
One future advancement may be using our methodology for TEM images without negative staining 28. If ML algorithms can classify TEM images without staining, it could further reduce time for virus studies. The virus images used in our study were negative staining TEM images as the dataset we had was composed of this kind of images 6. Application of our methodology to TEM images without negative staining could show the capability of ML in classifying them if the dataset without staining becomes available.
One of the limitations of this study was the limited number of SARS-CoV–2 images (n = 25 before image augmentation). We used image augmentation to generate more SARS-CoV- 2 images from available TEM images. The results for prediction of SARS-CoV–2 family were relatively better than other families (AlexNet and SqueezeNet, Fig. 3). This result may be due to limited number of SARS-CoV–2 images, and using augmentation to produce more images. If we had more images, the images used for training would have more variability. Having more images can lead to more accurate predictions for future SARS-CoV–2 images. This limitation can be addressed as more image data from this novel virus become available.
Also, the dataset used in this study has limitations. The dataset can be larger in which case the DL classification predictions for the SARS-CoV–2 can be made more generalized. Moreover, there could be other virus families that were not considered in the dataset. Those virus families can be closer to the SARS-CoV–2 than 15 virus families considered in this study. As such, inclusion of more virus families would improve our SARS-CoV–2 clustering outcomes. Our results may be improved by adding more images from 15 viruses as well as by adding more virus families.
The DL models predicted the family of each TEM image. In this study, we used three pretrained models namely AlexNet, VGG and SqueezeNet. Based on our approach, more pretrained models can be used to predict the virus families from TEM images. The final result can be based on the predictions by several DL models. Using this “ensemble approach”, the net outcome would classify a TEM image with higher accuracy than just using one model.
In this study, we used three DL models. As indicated above, by considering more DL models, the results can be improved. Also, other ML models such as decision three algorithms and support vector machine algorithms can be added to the models. The results obtained from single models or ensemble of models can be compared to develop better models for classification of viruses based on TEM images. Our approach can lead to faster, more convenient and more reliable automatic methods for classification of TEM images. These automatic methods can contribute to overt pandemics by early identification or speed up recovery by targeting the precise structure of the virus.