Bioinformatics-based Screening of Key Genes for Saponin Metabolism in Quinoa

Quinoa saponins have complex, diverse and evident physiologic activities. However, the key regulatory genes for quinoa saponin metabolism are not yet well studied. The purpose of this study was to explore genes closely related to quinoa saponin metabolism. In this study, the signicantly differentially expressed genes in yellow quinoa were rstly screened based on RNA-seq technology. Then, the key genes for saponin metabolism were selected by gene set enrichment analysis (GSEA) and principal component analysis (PCA) statistical methods. Finally, the specicity of the key genes was veried by hierarchical clustering. The results of differential analysis showed that 1654 differentially expressed genes were achieved after pseudogenes deletion. Therein, there were 142 long non-coding genes and 1512 protein-coding genes. Based on GSEA analysis, 116 key candidate genes were found to be signicantly correlated with quinoa saponin metabolism. Through PCA dimension reduction analysis, 57 key genes were nally obtained. Hierarchical cluster analysis further demonstrated that these key genes can clearly separate the four groups of samples. The present results could provide references for the breeding of sweet quinoa and would be helpful for the rational utilization of quinoa saponins.


Introduction
Quinoa (Chenopodium quinoa) has been widely cultivated in South America for 5,000 to 7,000 years 1 .
Quinoa's nutritional features have attracted the attention of many scienti c researchers. Meanwhile, quinoa's medicinal value has been paid increasing attention and developed [2][3][4] . Quinoa saponins (QSs) is a bioactive molecule with anti-obesity 5 , anti-cancer 6 , immune response 7 , anti-diabetes 8 , cholesterollowering 9 , anti-in ammation 10 , anti-oxidation 11 and other medical effects. At the same time, it is a kind of anti-nutrition factor with a slightly bitter and astringent taste 12 , and potentially toxic 13 , which hinders the application of quinoa in food, processing and promotion. The removal, extraction and maximum utilization of quinoa saponins have been a di cult problem for researchers. Therefore, reasonable regulation of saponin content in quinoa products can greatly improve the quality of quinoa food and the e cacy of quinoa drugs, which are of great signi cance for development of quinoa resources. It is an important goal of quinoa breeding to develop the saponin-free quinoa with good taste and improve the bioactivity utilization rate of QSs.
In the past 30 years, the research on quinoa saponins mainly focuses on the separation, extraction and detection of saponins, structure elucidation and biological activity identi cation. For example, Verza et al. 7 extracted and puri ed quinoa saponins using hydrochloric acid ethanol solution, and used ultra-high performance liquidity-Quadrupole time-of-ight mass spectrometry (UPLC/Q-TOF-MS) qualitatively identi ed 10 saponins. And their humoral and cellular immune reactivity was con rmed in mice. Gee et al. 14 assessed the effect of processing technology on saponins content and composition in quinoa products, and studied the biological effects of quinoa grains and cereal products containing high saponins and low levo-cholesterol in vitro in rats. It was found that processed quinoa reduced saponin concentration and membrane solubility activity, and improved palatability and nutritional quality of cereals to levels similar to those of wheat-based cereals. León-Roque et al. 15 developed and veri ed a microtiter macrolens-coupled smart phone (MCS) assay for quantifying total saponins in quinoa, which is a green and fast analytical method. More and more saponins are extracted and their molecular structure is elucidated, which lays a foundation for the study of their metabolic mechanism. The discovery and elucidating structure of quinoa saponins has laid a foundation for the further study of the metabolic mechanism of quinoa saponins.
Although the saponin metabolic pathways are relatively clear in the other plants, the study of the key genes involved in the biosynthesis of saponins in quinoa is still in its infancy. The biosynthetic pathway of QSs and its related enzymes have not been fully elucidated 21 . In recent three years, with the development of high-throughput sequencing technology, the metabolic mechanism of quinoa saponins has been intensively studying at the molecular level. The research on key genes of QSs has a certain breakthrough. In 2016, Fiallos-Jurado et al. 21

Materials And Methods
Acquisition and collation of sequencing data.
The RNA-seq raw data for four groups of quinoa samples (yellow bitter quinoa seeds (Yr), yellow bitter quinoa ower (Yl), white sweet quinoa seeds (Wr), white sweet quinoa ower (Wl), respectively; three biological replicates for each group) are available in the NCBI Sequence Read Archive (SRA), with accession number: SRP226463. The quality control (QC) of the sequencing data has been performed using Fastp (v 0.19.1) to obtain clean data. Clean data were used for subsequent analysis. Quinoa genome data along with transcript and protein data were downloaded from https://www.ncbi.nlm.nih.gov/genome/?term= chenopodium + quinoa .

Differential expression analysis.
Prior to the analysis for nding differentially expressed genes (DEGs), clean reads were aligned to a quinoa reference genome by Hisat2 (v 2.1.0). Gene counts in quanti cation analysis were computed by featureCounts from Subread (v 2.0.1). Then differential gene expression analyses were conducted using DESeq (v 1.28.0) package (http://www-huber.embl.de/users/anders/ DESeq/). DEGs were identi ed based on a corrected p values (padj) < 0.05 and |log2 foldchange| >1 Gene set enrichment analysis.
Gene set enrichment analysis (GSEA) was carried out with GSEA software (version 1.2), which was obtained from GitHub (https://github.com/GSEA-MSigDB/GSEA_R). The outcome of the differential expression analysis was taken to perform gene set enrichment analysis using the 'preranked' running mode. All the genes in the genome were sorted according to the expression correlation of gene of interest and included in the gene set enrichment analysis. The pre-de ned gene set of saponin metabolism pathway is composed of four kind of key genes during saponin metabolism. Enrichment map was used for visualization of the GSEA results. We selected as signi cant only those enrichments (Normalized Enrichment Score, |NES| > 1) with a normalized P value < 0.05, false discovery rate (FDR) < 0.25 after gene set permutations were performed 1000 times for each analysis.
The prcomp function of stats (R package, version 3.6.1) were used for PCA analyses. PCA analysis was performed using differentially expressed genes for dimensionality reduction and to determine the contribution of genes to each principal component. The genes with |loading value| > 0.1 were selected as the nal key genes. Based on key genes, R software package pheatmap (https://CRAN.Rproject.org/package=pheatmap) was used to generate a hierarchical cluster heat map and cluster tree.
Gene enrichment analysis.
The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis revealed the signi cantly enriched pathways of differentially expressed genes analyzed using the enricher function in R software package clusterPro ler (version 3.14.3) 24 . Gene Ontology (GO) term enrichment analysis was performed to provide all the GO terms that were signi cantly enriched in the differentially expressed genes using the enrichGO function in clusterPro ler package.

Study selection.
Brie y, six major steps were employed including data collection, data quality control, quanti cation, differential expression analysis, gene set enrichment analysis, principal component analysis. In the rst four steps, we selected the standard transcriptome sequencing analysis to obtain the DEGs. In the fth step, GSEA was selected to screen genes closely related to saponin metabolism, and the candidate key genes were obtained by taking the intersection with the DEGs. Finally, the key genes were obtained by PCA. The whole study process selection is described in the owchart (Fig. 1).
Statistics for the differentially expressed genes on the four groups.
We detected 4013, 19181, 20265 DEGs when comparing the gene expression levels between Yr and Wr groups, Yr and Yl, Yr and Wl respectively. Taking the intersection of three DEGs sets, a total of 1734 signi cant differentially expressed genes were obtained. The results of differential gene expression analysis were shown in Fig. 2. After further ltering out the pseudogenes, one resulted in a total of 1,654 DEGs.
Screening key candidate genes.
After differentially analyzing, we obtained DEGs in the seeds of yellow bitter quinoa. To further screen genes closely related to saponin metabolism, we carried out GSEA on all genes in the genome, aiming to screen the genes regulating saponin metabolism pathways. We selected four kind of saponin synthesis key genes as the pre-de ned gene set of saponin metabolism pathway. The detailed information of these genes was shown in Table S1. The correlation between gene of interest and other genes was used as the phenotype label. GSEA results showed that 5052 genes were signi cantly correlated with saponin metabolism. By taking the intersection of the gene sets and that from differential expression results, we obtained 116 key candidate genes (showed in Table S2). Among these, there were 11 long non-coding RNA and 105 protein-encoding genes. Therein, 18 genes are encoding two or more proteins. Fig. 3 lists the top six genes that are closely related to the saponin metabolism pathway based on Normalized Enrichment Score (NES). It can be seen that the genes in the saponin metabolism pathway are signi cantly enriched to the positive region of the rst ve genes (Fig. 3 a-e), while they are enriched to the negative region of the sixth gene (Fig. 3f).
Identi cation of key genes for saponin metabolism.
To select the key genes from key candidate genes, we performed principal component analysis on 116 key candidate genes (Fig. 4a). As can be seen from Fig. 4a that the rst component of the PCA explained 51.92%, while the second component explained 24.63%, of the variance. PC1 signi cantly distinguished the yellow quinoa seeds from the other three samples. PC2 can distinguished white sweet quinoa seeds from the other two samples (group Yl and Wl). Ultimately, 57 genes were selected as the nal key genes.
The details of component loadings for each gene are described in Table S3, while the scree plot of PCA is represented in Fig. S1. Details on the expression of 57 genes are presented in Fig. 5.
Hierarchical clustering of 12 samples according to the expression patterns of these 57 key genes suggested that these up-regulated and down-regulated genes could effectively discriminate Yr and the other three groups. The heatmap and hierarchical clustering were showed in Fig. 4b.

Discussion
As mentioned in the literature review 25,26 , saponins have high medicinal value. However, QSs have two sides. Despite medicinal value, on the other hand, its bitterness reduces the taste of food, which brings a lot of trouble to the production of quinoa food. To solve this dilemma, many studies have focused on the desaponi cation treatment of quinoa seeds in food production. With the development of high-throughput sequencing technology, particularly completion of quinoa genome annotation 23 , genomics-assisted breeding has gradually been put on the agenda for the improvement of quinoa varieties. This study was designed to mine the key genes related to QSs metabolism by RNA-seq technologies and several statistical methods. These ndings might help others to better understand the metabolic process and regulatory mechanism of QSs. The screened key genes can be used as promising targets for the molecular breeding of improved quinoa, which, in turn, would lay the foundation for the rational utilization of QSs.
The results of differential analysis revealed that there were more DEGs between developmental stages than between breeds. There were 19181 differential genes between the Yr group and Yl group (Fig. 2a); 20265 genes between the Yr group and Wl group (Fig. 2c); 4013 genes between the Yr group and Wr group (Fig. 2b). This may be because in addition to the differential expression of metabolic genes, the differences in the expression of structural genes were also highly signi cant than that between breeds.
Through the intersection of three differentially expressed gene sets, most of the differentially expressed structural genes were eliminated, while the differentially expressed metabolic genes were maximally retained. We also carried out KEGG pathway analysis of the screened DEGs and found these genes were signi cantly enriched in metabolism-related pathways (Fig. S2), which con rmed our speculation.
Gene set enrichment analysis is useful for identifying genes closely related to saponin synthesis. We obtained 116 key candidate genes by performing GSEA. Among 116 genes, 56 genes were up-regulated and 60 were down-regulated between Yr group and Wr group (Table S2). Remarkably, we found that the majority of saponin metabolic pathway genes in Yr group compared to Wr group were down-regulated (Table S1), which was contrary to the up-regulation of saponin content. Consequently, we speculate that there are new metabolic pathways for saponin in yellow quinoa seeds in addition to the metabolic pathway to our knowledge. In 57 key genes analyzed by PCA, we found 22 genes encoding uncharacterized proteins closely related to the metabolism of quinoa saponin (marked by asterisks in Fig. 5), which partially corroborates our speculation.
In order to select key genes from 116 genes, principal component analysis was carried out (Fig. 4a). We ltered 57 key genes using PCA. Details about the expression of these genes are displayed in Fig. 4b and Fig. 5. As can be seen, of the 57 selected genes, 51 genes were up-regulated and 6 genes were downregulated between Yr and Wr group. We speculated that these up-regulated genes were positive regulators of saponin biosynthesis, while down-regulated genes may play a role in negative regulators.
To identify key genes, hierarchical cluster analysis and PCA analysis were performed based on 57 genes. Hierarchical clustering of 12 quinoa samples according to the expression patterns of 57 key genes suggested that these genes could effectively discriminate four groups (Fig. 4b). PCA analysis showed that the expression of 57 key genes explained 84.19% of the variable (Fig. S3), which better explained the differences of the four groups of samples than that based on 116 genes (Fig. 4a).
There are still some weaknesses of this work. Firstly, we have not identi ed the 57 key genes in an experimental way. Secondly, in this study, one interesting nding was we also identi ed 10 lncRNAs. Previous studies have been suggested that, in plants, lncRNAs play important roles as regulators in gene silencing 27 , owering time control 28 , organogenesis in roots 29 , photomorphogenesis in seedlings 30 , abiotic stress responses 31,32 , and reproduction 33 . What role these 10 lncRNAs play in the regulation of saponin metabolism that we need to explore further. Thirdly, in this study, due to incomplete information on genome annotation, we found some uncharacterized genes (Table S1 and Fig. 5), whose speci c functions need to be further veri ed in the saponin biosynthesis process.

Conclusion
The main goal of the current study was to mine the genes associated closely with quinoa saponin metabolism. Based on differential expression analysis between groups using RNA-seq technology, 1734 differentially expressed genes were obtained. subsequently, 116 genes strongly related to saponin metabolism were screened using GSEA. Finally, 57 key genes were obtained as a result of PCA dimension reduction analysis. Though several limitations need to be acknowledged, this pilot study is the rst genome-wide screening of quinoa saponin metabolism key genes. The ndings from this study might help others to better understand the metabolic mechanism of saponins in quinoa and provide a reference for the breeding of low saponin quinoa varieties. With regard to the outstanding issues in the study, further research will be undertaken in the following work. Figure 1 Flowchart of screening for the key genes associated with the metabolism of quinoa saponins. QC: quality control; DEGs: differential expression genes.