In this study, we developed a total of 40 DL models to reduce the time required for the diagnosis of COVID-19 using RT-PCR as much as possible and compared the diagnostic and screening performance of each model.
In a previous meta-analysis, Kim et al.12 determined that the the pooled sensitivity of RT-PCR was 89%, and the PPVs and NPVs, affected by the prevalence, were 47.3–98.3% and 93.4–99.9%, respectively. We used the diagnostic performance of the RT-PCR test investigated by Kim et al to compare the performance of each model obtained in this study.
Considering pooled RT-PCR sensitivity of 89% as a sensitivity reference value12, the sensitivity of the 21st model exceeded this standard and showed 93.33% (95% CI; 86.05–97.51%). In addition, considering the approximate trend of diagnostic performance of all models, the 24th model with a sensitivity of 90% (95% CI; 81.86–95.32%) showed a tendency to exceed the sensitivity reference value. In view of these results, using a Ct value is 36 rather than the time taken by 40 cycles for RT-PCR diagnosis, it can be inferred that meaningful shortening may be possible through the development of this DL model.
Furthermore, the sensitivity reference value also exceeded or showed a similar level from the 3rd model to the 9th model and in the 11th, 16th, and 18th models (Supplementary Table 1). However, the specificities of these models were generally lower than 80%, so it was difficult to judge whether the model was appropriate based on the diagnostic performance.
In the case of the PPV in this study, in the United States, where the prevalence was 10.06%, the 25th model showed the highest PPV at 73.33%. Similarly, in Italy with a prevalence of 6.98% and South Korea with a prevalence of 0.27%, the PPV was highest in the same model as that in the United States at 65.02% and 6.43%, respectively. However, according to the study results of Kim et al.12, in the United States with a prevalence of 17.7% in March-April 2020, Germany with a prevalence of 5.7%, and Taiwan with a prevalence of 1%, the PPVs of RT-PCR itself were 95%, 84.3% and 47.3%, respectively. Although the prevalence did not match between the two studies and the timing at which the prevalence was measured was different, considering the range of prevalence levels, it can be inferred that the positive screening performance of the model developed in this study is somewhat inferior to that of RT-PCR.
On the other hand, in the case of negative screening performance, which is affected by the prevalence, in the United States, where the prevalence is 10.06%, the 20th model already showed a NPV of 96.34% (95% CI; 95.89–98.33%), and in Italy (prevalence 6.98%) and South Korea (prevalence 0.27%), the NPV was 98.21% (95% CI; 97.18–98.86%) and 99.21% (95% CI; 99.90–99.96%) in the same model, respectively.
According to the research results of Kim et al.12, the PPV and NPV of RT-PCR showed a distribution of 47.3–98.3% and 93.4–99.9%, respectively, according to the national prevalence (prevalence range of 1–39% from March ~ April 2020). Considering the screening performance of RT-PCR itself, the negative screening performance of the models developed in this study may be considered at a similar level compared to that of RT-PCR.
In this study, we made a radar chart for each model using PPV, NPV and accuracy, which were affected by prevalence, representing screening performance. Then, the screening performance of each model was expressed by expressing the ratio of the area covered by each model to the total area of the radar chart as a percentage, and the area ratio of each model was made into a radar chart. Through this chart, we could confirm that the model with the largest area ratio was the 25th model when considering the PPV, NPV and accuracy. We hypothesize that it would be reasonable to present the 25th model as a model with minimal bias in negative screening performance, positive screening performance and accuracy based on these results.
To the best of our knowledge, no study has reduced the time required to diagnose based on RT-PCR by developing a model trained with raw RT-PCR data and confirming its diagnostic performance. Although there was a study that used RT-PCR curves to build an AI model such as a convolutional neural network (CNN) to reduce false-positive diagnoses, the study was not related to shortening the time for diagnosis and used graph images, differentiating it from our study13. In addition, a recently published AI- and DL-related COVID-19 diagnostic study was about a model trained on CT images or CXR images using various CNN methods. Other studies on the diagnosis of COVID-19 have been about a model trained with blood test results or clinical information. First, looking at the studies that reported the performance of models trained based on CNNs using chest CT images, the sensitivity ranged from 77–90%, the specificity ranged from 68–96.6%, and the AUROC ranged from 0.85 to 0.971~3, 14~19. Second, in studies that reported the performance of models trained on CNNs using chest CXR images, the sensitivity ranged from 78–97%, the specificity ranged from 72.6–99.17%, and the AUROC ranged from 0.77 to 0.924~7, 20~22. Third, there have been studies evaluating the diagnostic performance of COVID-19 using models trained with blood tests or clinical information. In these studies, the sensitivity ranged from 66–93%, the specificity ranged from 64–97.9%, and the AUROC ranged from 0.86 to 0.97923~25.
What is needed in the clinical field is to increase the efficiency of hospital bed resource management through rapid isolation, rapid diagnosis, and rapid and safe release from isolation. From that perspective, the above studies suggest that COVID-19 diagnosis may be possible through the application of AI. However, due to the imbalance and bias of the data selected for use in training, we question whether this approach can be safely used in clinical settings for the diagnosis of COVID-19. On these issues, Laghi A agrees that efforts to diagnose COVID-19 through AI models are necessary. However, he said that it seems very risky to trust the diagnostic performance of the AI model presented in these studies and use it in clinical settings because imaging tests such as CXR or chest CT at the early stage of COVID-19 infection can show normal findings26.
The model developed in this study is not a model trained from imaging tests such as CXR or chest CT, blood test results, or clinical information, as in previous studies. In this study, a model trained with LSTM was developed as a DL method applied to time series data training using raw data from 1 to 40 cycles of RT-PCR. There is potential for early diagnosis via RT-PCR using the DL model developed in this study. In this study, the sensitivity of the 21st model had already started to exceed the sensitivity reference value, and the sensitivity and specificity of the 24th model had exceeded 90%. Considering the time it takes to diagnose after 40 cycles of RT-PCR, the diagnostic performance of the model developed in this study shows the possibility of reducing the time taken for RT-PCR diagnosis by almost half. In addition, the model developed in this study showed that the PPV had somewhat lower positive screening performance than RT-PCR; however, the NPV showed negative screening performance similar to that of RT-PCR. If various information, such as the patient's clinical characteristics, blood test results, and imaging information, such as CXR or chest CT results, are combined with this DL model, it can be assumed that the diagnostic performance for early diagnosis will be more sophisticated. We can infer that these efforts have the potential to contribute to improving the efficiency of in-hospital bed resource management for patients with fever or screening symptoms.
There are several limitations in this study. First, 181 positive cases and 5,629 negative cases used for training constituted too few positive cases compared to negative cases. This data bias can affect the diagnostic performance of the developed DL models, and in the end, it is difficult to use the DL model universally. However, through this study, we were able to confirm that the diagnostic performance was not significantly impaired without performing all 40 cycles of PCR. Second, other than LSTM, other DL methods that can be trained using time series data were not applied. As a result, it is not known whether LSTM is the best method because comparative analysis with models that can be developed through other DL methods has not been performed. Nevertheless, LSTM is a recurrent neural network (RNN)-based method that was first selected and used in this study because it is a method created to solve the vanishing gradient problem of existing RNNs27. Of course, it is necessary to collect additional data in a follow-up study and perform comparative analysis with DL methods applied to time series data. Third, the method of presenting the screening performance of the model as the ratio of the area of the radar chart is not a general method. The area of the triangle is calculated assuming the PPV, NPV, and accuracy as a weight of 1:1:1. Therefore, if this weight is set differently, that is, if the three weights are set differently according to need (such as accuracy being more important, etc.), the calculated area and the ratio may be different. Nevertheless, as the PPV, NPV and accuracy all have higher values, it is natural that the screening power increases. We thought the ratio of the area of the radar chart did not perfectly reflect the screening power of the DL model; however, it would help to explain the approximate trend.