The advent of sequencing technologies and availability of sequence information in publicly available databases provide valuable resources for fundamental understanding of genes and pathways regulating biological processes. Besides, it provides unique opportunities in comparative analysis of multiple datasets for resolving research problems which would otherwise be less successful in pair-wise datasets comparison. For example, a typical transcriptome analysis provides gene expression information for at least > 10k genes in different samples and time points especially in eukaryotes. Usually expression data is analysed by various statistical models for the identification of differentially expressed genes (Thomas et al, 2001). As compared to individual experiments, analysis of expression data in multiple datasets and experiments is highly challenging but would provide novel gene information and pathways common and characteristics of multiple datasets. Further, it also provides global patterns in gene expression and useful in identification of important features responsible for causal phenotypes. Though, meta-transcriptome analysis are reported (Leimena et al, 2013; Cohen and Leach, 2019), machine learning based computer algorithms used for gene expression analysis is being used increasingly nowadays due to its high prediction power and pattern characterization in the large voluminous data (Pizroonia et al, 2008).
Machine learning uses computational methods and trains the computer for effective data analysis through loss function models (Goodfellow et al, 2016). The process of initial data training for selection of appropriate model parameters and then using the well-trained model for testing the sample input data results in high prediction accuracy and better pattern recognition. Especially, machine learning is nowadays used routinely for solving the genetics-based research problems (Libbrecht and Noble, 2015). In machine learning, there are unsupervised and supervised machine learning models employed in the analysis of genetic data sets which includes sequence information of genes and promoters, expression data of genes, and epigenetic datasets (Shipp et al, 2002).
Machine learning task are generally categorized into supervised and unsupervised learning methods (Lopez et al, 2018). The unsupervised machine learning methods are useful in finding the patterns present in the genetic (expression) information. However, the data are not labelled in unsupervised machine learning tools. In contrast, supervised machine learning comprises of labelled data used for training the models through computational methods and algorithms (Abdulquader et al, 2020). Further, trained models are used for solving different types of research problems broadly categorized into classification and regression analysis (Liaw and Weiner, 2002). Besides, machine learning algorithms are capable of finding hierarchical order of feature importance, prediction of complex traits, analysis of population genetic drift, location of causal genes for complex diseases and traits (Libbrecht and Noble, 2015).
Rice is one of the major food crops which are genetically and agronomically adapted to high water requirements (~ 2500–5000 litres of water for 1 kg of paddy) for producing a unit quantity of economic product. Thus, evaluation and improvement of rice varieties in limited/reduced irrigation strategiesfor high genetic gain is becoming an important research area recently because of anticipated effects of climate change (Tuong and Bouman, 2003). Generally it is presumed that multiple abiotic stress tolerance in rice varieties are associated with greater adaptation to growth in unfavourable ecologies such as limited water availability or rainfed cultivation. Previously several major genes, quantitative trait loci, genes under linkage disequilibrium associated with abiotic stress tolerance were identified for regulating drought tolerance in rice (Bernier et al, 2007). However, though major genes and pathways are well characterized for individual abiotic stresses in rice namely drought and salinity, overall common mechanism regulating abiotic stress tolerance needs to be understood with higher degree of clarity for finding novel solutions to address the requirement of higher yield under rainfed drought prone conditions. In the present work, gene expression data of multiple abiotic stresses were analyzed using machine learning models for the identification of genes and pathways capable of classifying the stress and non-stress (control) conditions in rice.