The global disease burden study measured the number of years of life of people with disabilities [2] and found that hearing loss is the fourth major cause of disability in the world. The main impact of adult hearing loss is communication barriers, which may adversely affect relationships with family and friends and cause difficulties in the workplace. Untreated hearing loss in adults also has indirect health, psychosocial and economic effects, leading to social isolation and decreased quality of life [3, 4, 5].
Prediction models using machine learning can generate robust diagnostic and predictive parameters, because they use the relationship between data to generate accurate predictions, so as to verify their incremental effectiveness[27–29]. Establishing a reasonable machine learning gene analysis model of noise induced hearing loss is not only the basis for exploring the etiology and treatment of noise induced hearing loss on the basis of genetics, but also the basis for noise protection and screening in special posts or specific populations. Mutants of gene VWF appeared in 9 individuals which appears most frequently. Studies have found that VWF is related to NIHL disease in workers exposed to noise[30, 31]. But no more research has been done. Another study showed that VWF may play a role in sudden sensorineural hearing loss in adults and more than half of patients have thrombotic tendencies[32]. These may be related to the pathogenesis of the auditory system. In the future, we can study this gene from the mechanism of hearing loss.
The difference of hearing loss between subgroup A and subgroup B is not significant, and the difference may be mainly in the pathogenesis of hearing loss between the two groups. These people who may have NIHL can only be identified from genomic information. Whether these subtypes have different etiology or other clinical significance needs further study. The internal variability of each individual genome poses a challenge to variation analysis[33]. The presence of > 5% (i.e. 4 of 78 NIHL patients) was considered necessary to identify fixed variants in the NIHL population. Due to our strict screening conditions, the number of possible susceptibility genes in the population is relatively small. We selected 73 genes that are common to at least 4 people and do not exist in 35 people in the control group, and classified and predicted them as features of unsupervised and supervised machine learning.
We identified top ten specific mutations that were predicted to be harmful or destructive, including UTP20, DNAH1, ADAMTS5, SMPD4, COL5A1, SLC12A6, LIPH, MYO18A, CRB2 and SRMS. The mutation of UTP20 is mainly a marker of subgroup B. Possible susceptibility gene VWF, COL5A1, COL17A1, MYO18A exist only in subgroup A but not in subgroup B. And UTP20 was found in 7 people in all NIHL populations and only in subgroup B and Gene Ontology (GO) annotations related to this gene include RNA binding and binding. However, the mechanism of these genes in hearing loss remains to be studied. The above genes play a role in distinguishing the two subgroups. In previous studies, VWF and COL5A1 gene mutations were indeed found to be related to hearing loss[34]. Subgroup A with these genes was indeed worse at high frequency hearing threshold than subgroup B with few of these genes. However, the genes carried by subgroup B that can clearly distinguish subgroup A and subgroup B may be less harmful, and the mechanism needs further study.
The random forest performs better than the decision tree in predicting the classification of subgroups by the possible susceptibility genes we find. This result is expected, because it is known that a random forest with multiple single trees is more robust than a single decision tree. Random forest considers the output of multiple decision trees, which solves their sensitivity to training data, thus reducing the variance in data results.
Although we claim the functional role of the genes involved in the genome variation spectrum, this does not mean that we refer to the causality of NIHL. Not all profile variants need to be directly related to pathology; These variants may be biomarkers, but whether there are differences in the changes of auditory function needs to be studied in a larger sample. Importantly, the subtypes we describe can only be distinguished by their genetic characteristics rather than clinical parameters, suggesting that they can be detected by the molecular depth of genomic analysis so far.
However, this study has some limitations. Firstly, the prediction algorithm used in this study is suitable for a relatively small sample size. Generally speaking, the algorithm used in this study performs better with a larger sample size. Especially for logistic regression models, these algorithms may require 10 times the events of each predictive variable to achieve a small amount of overfitting[35].
We distinguished the genomic variation spectrum between NIHL cases and controls, and also revealed two subtypes of NIHL. Although the genetic and pathogenic nature of variation is unclear, different variation spectra can be used as building blocks of prediction models to detect subtype A in the general population and provide evidence for the use of genomic and personalized medical prediction of noise sensitive procedures. We acknowledge that the good predictive value of our model depends on the 73 susceptibility gene variations correctly predicted.