Human DNA AI Model to Predict COVID-19 Symptomatic or Asymptomatic Percentages


 The current paper proposes to use convolutional neural networks (CNN) to analyze human genome single nucleotide variants (SNVs) from nuclear deoxyribonucleic acid (DNA) and
mitochondrial deoxyribonucleic acid (mtDNA) presented as a 2D image structure to understand if the answer to COVID-19 severities can be found in the human genome. That methodology was implemented with 447 Mexican population samples. From the results, two main groups were formed divided into symptomatic and asymptomatic cases composed of 80.986% and 19.014% respectively and the model was validated through an online survey of individuals, giving a 91.89% of accuracy.


Background
In December 2019, there was an outbreak of pneumonia of unknown cause in Wuhan, Hubei province, China. This disease outbreak attacked locals with complications as: fever, malaise, dry cough, shortness of breath and respiratory failure 1 . On March 11, the disease was already in more than 100 territories worldwide, and it was recognized as a pandemic by the World Health Organization (WHO) 2 . The number of confirmed cases continued to grow worldwide. To prevent the spread of the virus, governments have imposed travel restrictions, quarantines, lockdowns, social isolation, cancellation of events, and closure of establishments. The pandemic is having a disruptive socioeconomic effect 3 .
There have been different investigations about the causes of the COVID-19 severity in the population related to research in blood type, previous or current smoking, among others 4,5,6 and there are still no convincing results about the causes.
Genomics is a wide research field in numerous diseases and COVID-19 is not an exception. 7 Found significant differences in the structural genomics of patients who had severe reactions to the virus vs. general population. However, the test developed does not have a predictive value, so further research is still needed. An alternative approach is exemplified in the work of 8 who, based on a genome wide association study (GWAS) in which mortality was taken as the primary endpoint in patients with COVID-19, defined 8 super variants that reflect the interaction of multiple loci associated with an increased risk of mortality. Another important finding is described by 9 , who evaluated 97 patients with COVID-19 at Barnes-Jewish Hospital by measuring their circulating mtDNA levels on the first day of their hospital stay. They found that mtDNA levels were much higher in patients who eventually died or were admitted to the intensive care unit. This association held independently of the patient's age, sex, and underlying health conditions. Genome genotyping is an open door to long periods of research of certain diseases or specific conditions for prognostic tests based on people's DNA to give information about family history, ancestry, personal identity, and health info. Companies like SIKUENS Genetics, 23andMe, CRI Genetics, Ancestry DNA, among others, offer this type of services and they constantly move forward with the investigation of a growing number of single nucleotide variants (SNVs) related to certain diseases or conditions to offer more information panels for their reports 10 . Similar to these companies, biomedical and biotechnological research institutes worldwide need long periods of research to determine the entire genome related to a certain disease. This is the reason why we moved forward In different research, data capacity in 2D structures has been improved using different techniques for different types of applications, such as DNA QR coding for security systems 12 , DNA species identification 13,14 , but data capacity in 2D structures is still a challenge for DNA information encompassing 642,824 SNV's with two alleles per each SNV for a total amount of data of 1,285,648.

Methods
Based on a previous work by 15 we considered using a similar technique for an SNVs omic analysis.
The proposed structure is an image of 802 pixels of width and 802 pixels of height for a total amount of data of 643,204 pixels. The genome was represented in each pixel as a combination of the two alleles per SNV and can be visualized in Figure 1.

Training stage
To the unsupervised CNN it is provided a set of DNA of individuals in the Mexico population that can be considered as having contracted COVID-19 together with the information if the individual has developed symptoms of contracting COVID-19.
The DNA dataset is important to mention that it is extracted from the oral epithelium through cells from the walls of the mouth, not from saliva using a scraping methodology. Such sample is genotyped and can be replicated through a list of SNV's published in 18  Once the clusters were formed, the CNN learned the patterns in those clusters and ended correctly classifying the validation dataset to the corresponding clusters. The CNN architecture proposed for the current research was implemented in Python using Google's Tensorflow and Keras for the supervised learning and can be visualized in Figure 2. The learning process of the CNN algorithm can be consulted in 17 .
When analyzing the dataset conformed by 447 images, in first instance, we found a 76.95% of images in a main cluster of related ones and another main cluster with 23.04% of the images; this relations between cluster are based on a defined threshold greater than 95% of probability (Ҩ ≥ 0.95).
In Mexico, the tests have only been used as a diagnostic method and not as a tool to predict the severity of the response to the disease, in addition to the percentage of false negatives and false positives, therefore, we considered that only using the results of a positive test of COVID would not be reliable

Validation stage
In a second stage, it was provided to the CNN a set of DNA also in the Mexican population, without any information if the individual has contracted COVID-19 and the CNN must identify which ones would develop symptoms. Two clusters of asymptomatic and symptomatic individuals were obtained composed of 80.986% and 19.014% of images.
After the two main clusters were formed, a survey was designed and sent to the individuals, clients of the company, but only 37 answers were received. The survey was composed by one question with 5 options, and they were: 1. I have not suffered from COVID-19.

I have lived with infected people and I have not contracted COVID-19.
3. I have been infected with COVID-19 and I have not presented symptoms. 4. I have been infected with COVID-19 and I have had mild symptoms.

I have been infected with COVID-19
and have had severe symptoms.
Once the answers were received, we divided the answers into 3 groups. The groups are: 1. Answer 1 is Uncertain, because when the moment of the survey the person can still not retrieve de symptoms due to has not been infected yet or otherwise will not retrieve symptoms.

5.Results
Finally, when the answers of the survey were categorized in previous mentioned groups and the Asymptomatic or Symptomatic case and compared to the individual answer was an Uncertain class, the result was considered as good prediction.

Conclusions and future work
It can be concluded that, observing the results presented in this research, the clusters formed are an exceptionally good approximation to COVID-19 statistics of virus severities in Mexican population, but we cannot conclude the clusters are related to COVID-19 without a validation process. So, the model was validated through the individual's survey and resulted with a 91.89% of accuracy based on those answers considering that that the answers are not a clinical test result as polymerase chain reaction (PCR). This is the reason why we can interpretate the 3 wrong predictions as ambiguous because the person can be pre-symptomatic and has still not developed symptoms yet. In this manner, the accuracy of the CNN model would be much higher but for the mean time we can conclude that the clusters formed are related with COVID-19 due to the validation process; as well we can conclude that the causes of COVID-19 severities lie on human genetics so it may or may not potentially allow the virus to advance.
Finally, considering future work, it will be incorporating datasets from different populations around the world and observe the clusters and percentages formed with help of data access of the human genome as well as clinical PCR tests or surveys of the population corresponding to those DNA data for model validation.