Understanding the emergence and progression of complex diseases incessantly pose challenges to researchers due to its intricate and multifactorial nature. These diseases are caused by interplays between genetics and environmental factors leading to a plethora of combinations that need to be considered in modeling. From the genetics’ aspect, understanding the etiology of complex diseases necessitates an extensive localization of significant genomic variations due to its polygenic nature [1, 2, 3]. Identifying these biomarkers, albeit elucidating only a portion of the entire underpinnings of complex diseases, could nevertheless aid in increasing patients’ chances of survival by allowing a more personalized and advanced disease risk assessment .
A genome-wide association study (GWAS) is the traditional approach employed to discover genetic biomarkers, i.e. single nucleotide polymorphisms (SNPs), associated with various traits and diseases . GWAS has been successful in identifying several risk loci for a wide array of illnesses including cancer , Type 2 diabetes mellitus , Crohn’s disease , and coronary artery disease , among others. However, despite these achievements, GWAS faces limitations due to its individual-SNP analysis approach exacerbated by the high dimensionality of genomic datasets. As multitudinous individual association tests are performed, stringent thresholds must be adopted to account for error rates leading to underpowered detection . This increases the probability of not detecting SNPs with small effects that are truly associated with a trait and could significantly contribute to phenotypic variability . The traditional GWAS approach also fails to capture SNP-SNP interactions as it only tests for the marginal effects of SNPs and disregards the variants’ joint contributions to phenotypic expression. These interactions require explicit analysis since they are vital in addressing the “missing heritability” problem  which states that single genetic variations are insufficient in explaining the entire heritability of a trait.
Under the “polygenic paradigm”, refining statistical models, such as increasing sample sizes  and reducing the number of tests employed , is crucial in increasing the chances of discovering true associations. Empirical evidence [15, 16] has shown that as sample size increases, GWAS continues to yield more novel trait-associated loci. However, this approach is not always feasible  especially for studies involving small populations and diseases with low prevalence. For this reason, it is more viable to reduce the number of tests employed to relax the stringent conditions used to consider genomic variants as significant. Existing approaches to this latter strategy include haplotype-based association analysis and SNP-set analysis, both of which also address the inability of GWAS to capture SNP-SNP interactions [17, 18]. Haplotype-based analysis  accounts for linkage disequilibrium between SNPs; while SNP-set analysis, e.g. gene-based  and pathway-based analyses , considers the joint effects of variants on phenotypic expression. Aside from addressing the aforementioned GWAS’ limitations, SNP-set analysis further permits hypothesis testing on associations possibly existing between wider loci and traits . However, when this type of analysis groups SNPs based on prior biological knowledge, a study’s success may be hampered when information on genetic variations and competitive pathways related to the trait are insufficient. To allow a less restricted analysis, it is necessary to explore other methods of forming SNP-sets using information independent of a priori biological knowledge.
Machine learning (ML) is an innovative and powerful approach used in solving complex problems in various fields and disciplines due to its capability to handle and analyze high-dimensional datasets [22, 23, 24]. Several studies have already demonstrated the usability of ML in genomic datasets [25, 26, 27]; however, to our knowledge, there is only a handful of existing literature discussing its application to SNP-set formation [28, 29, 30, 31]. These studies employed cluster analysis to form SNP-sets in a data-driven manner. This approach could subsequently lead to the identification of novel risk loci associated with a trait , albeit there may be problems related to computational complexity and cost. As genomic datasets are usually of high dimension, it is susceptible to the “curse of dimensionality” [32, 33], a problem that could be addressed by solely clustering the SNPs found in certain genomic regions that are known to play a role in trait development [29, 30]. However, this approach defeats the purpose of performing an inclusive analysis as the search for significant biomarkers is restricted by relatively narrow regions. For a more varied selection of SNPs to analyze, dimensionality reduction techniques based on random forest (RF) could be used to reduce dataset dimensions before conducting cluster analysis. RF has been widely incorporated in SNP research [25, 34, 35, 36] due to its significant properties: (1) a nonparametric nature that allows the establishment of predictive models without the need for preliminary statistical assumptions, and (2) the capability to provide an importance score, i.e. variable importance measure (VIM) for each SNP, which increases the probability of detecting highly relevant biomarkers.
Cluster analysis and random forest have already been proven applicable and effective in genomic data analysis, specifically in identifying predictive and presumably disease-associated SNPs [31, 37]. However, based on the literature review, the integration of these approaches has not been explored on SNP data. This study aims to incorporate these two techniques to augment previous GWAS findings and allow the discovery of novel trait-associated susceptibility loci. The study implements the proposed integrated framework using the following three-step algorithm:
Dimensionality reduction through RF;
SNP-set formation through cluster analysis involving top-ranking SNPs from Step 1 and SNPs considered by GWAS to be significantly associated with the trait of interest (termed in this study as ‘GWAS-identified SNPs’); and
Association testing on the resulting SNP-sets from Step 2.
In Step 1, dimension reduction is implemented using random forest feature selection to circumvent the “curse of dimensionality” problem associated with analyzing high-dimensional SNP datasets . In Step 2, top-ranking SNPs determined from the results of Step 1 and GWAS-identified SNPs are subjected to cluster analysis to evaluate shared similarities among the variants and form SNP-sets. Finally, Step 3 involves testing the SNP-sets derived from Step 2 for trait-association. The proposed methodology was applied to the GWAS data by  wherein the phenotype of interest is hepatitis B virus surface antigen (HBsAg) seroclearance, a marker for clearance of chronic hepatitis B virus (HBV) infection.