Illumina sequencing and genome assembly
The RNA-seq results of the three samples (root, leaf, and stem) showed a relatively large number of raw reads, over 120 million reads per sample, which ensures high-quality in-depth analysis. The proportion of high-quality (effective) reads was also very high, ranging from 98–99%. The error rate was very low, only 0.02%. The base accuracy rate was very good, with Q20 and Q30 scores > 94%. The GC content was balanced between 41% and 42%. Overall, these results indicate that the RNA-seq data is of high quality. The large data volume, high base accuracy, and high proportion of usable reads make it suitable for downstream analyses such as transcript assembly, gene expression analysis, and gene identification related to drought stress tolerance. High-quality reads with QC > 20 were selected, and low-quality reads and adaptor sequences were removed using a trimmer (Genbank: PRJNA1042660). The high-quality reads were assembled using Trinity and mapped back to the reference transcriptome assembly using HISAT2. Additionally, the reads were also mapped to the reference genome for transcript expression analysis of the genes of interest. The results of the transcriptome assembly quality assessment showed a high coverage of core genes, with 90.98% (complete) and 95.29% (complete + partial) coverage. This indicates that the transcriptome assembly is of high quality, reflecting the presence of the most important genes in the genome. The fragmented BUSCO gene ratio was low (4.71%), indicating that few genes were fragmented during the assembly process. The missing BUSCO gene ratio was also low (4.3%), indicating that the assembly covered most of the essential genes (Fig. 1). Overall, with the recovery of > 90% of important genes and good representation of the transcriptome features, this can be considered a high-quality assembly suitable for downstream gene analyses (Table 1).
Table 1
Transcriptomic assembly and annotation statistics.
# Contigs | 168045 |
# contigs (≥ 0 bp) | 173577 |
# contigs (≥ 1000 bp) | 46043 |
Total length (≥ 0 bp) | 131727957 |
Total length (≥ 1000 bp) | 77755015 |
# contigs (≥ 500 bp) | 82743 |
Largest contig | 8219 |
GC (%) | 39.64 |
N50 | 1484 |
# N’s per 100 kbp | 0.00 |
C:90,98%[S:33,3%,D:57,64%],F:4,7%,M:4,3% |
Total number of core genes queried | 255 |
Number of core genes detected Complete Complete + Partial | 232 (90.98%) 243 (95.29%) |
Number of missing core genes | 12 (4.71%) |
Average number of orthologs per core genes | 2.17 |
Transcriptome annotation
Using OmicsBox software with InterProscan, Blastp, GO mapping, and GO annotation tools, 71,230 unigenes from the transcriptome were annotated based on the NCBI, SwissProt databases, and Hayai-Annotation Plants (v. 2.0) databases. The results showed that 47,820 unigenes (nearly 68%) contained conserved motifs and domains, which helps in predicting protein function (Fig. 2). Thus, the analysis of the peanut transcriptome using OmicsBox has provided significant insights into the functions of most genes. This will be an important foundation for future gene functional studies and peanut variety improvement. The total number of GO annotations for the genome was 173,408. The average number of GO annotations at each level was relatively uniform, indicating the richness of functional information at different levels of GO. An analysis of the distribution of contigs annotated in different ranges, across the GO levels offers a possibility to learn about the accuracy of the annotation. The highest GO level was 15 and the lowest was 2, indicating a moderate level of detail (not too general or too broad). For biological processes, molecular functions, and cellular components, the main concentration was within levels 4–9, 3–7, and 3–7 respectively. This particular pattern of level-wise distribution for each GO category suggests that the process of annotation is highly accurate, and confident annotations can be made from this kind of annotation (Fig. 3). The results of the GO annotation analysis were able to provide comprehensive and detailed functional annotation of the peanut transcriptome.
Gene functional classification
The results of analyzing the top GO functional groups at level 3 in the three categories of BP, MF, and CC are depicted in Fig. 4. The BP category exhibits a wide range of different metabolic and regulatory processes, reflecting the diverse biological activities of peanuts. The most common GO term was “organic metabolism” with about 20,000 sequences; the least common term was “metabolic regulation” with about 5,000 sequences. The large number of GO terms related to metabolism and transformation is characteristic of plants in general, reflecting their robust metabolic activity. In addition, peanuts are rich in nutrients such as proteins, oils, and carbohydrates, which may explain the large number of processes involved in the metabolism of organic compounds and nutrients. The MF category focuses on catalytic, binding, and transport activities, suggesting the involvement of multiple proteins in essential functions. The highest frequency of the GO term was “catalytic activity, acting on L-amino acid peptides” with 15,000 sequences, and the lowest one was “ATP hydrolysis activity” with 5,000 sequences. The diversity in enzyme and protein complex activities contributes to the functional capacity of a genome. The CC group suggests basic elements of cellular organization, containing several instances of supramolecular complexes, cell organelles, as well as other structures. Among the annotations given by the sequences, the GO term with the highest occurrence was “intracellular anatomical structure”, accounting for about 20,000 sequences, and the one with the lowest occurrence was “extracellular region”, denoted only by 5,000 sequences. The identified number of sequences among these groups varied from 5,000 to 20,000, showing that the L14 peanut transcriptome data has a wide range of GO group diversity and abundance.
Identification of genes involved abiotic stress tolerance
A total of 71,230 predicted genes were analyzed using the KEGG database, of which 33,670 proteins were functionally identified in a relatively high proportion (Fig. 5). This indicated that the protein prediction and classification pipeline from gene sequences is effective. The group of proteins involved in genetic information processing (DNA and RNA) was the most abundant, accounting for 18.5%. This may reflect the important role of transcription and translation regulation mechanisms in peanuts. Proteins involved in carbohydrate and lipid metabolism and environmental response also accounted for a high proportion (8–10%), which showed that peanuts require a lot of proteins to regulate these processes for adaptation to environmental conditions during the flowering stage.
Besides, the results also showed a difference in the number of genes between peanut tissues (Fig. 6). The leaves had a number of predicted genes similar to the stem, with 52,972 and 52,967 genes, respectively, which was higher than the roots (44,367 genes). The leaves and stem had the most genes in common (11,197 genes), which could reflect the functional similarity between these two tissues. They have many interrelated metabolic processes, such as the transport of solutes from leaves to other parts. Additionally, the cellular composition and tissue structure of leaves and stems are very similar, especially the inner soft tissue. The similarity in the number of overlapping genes could also reflect the close functional relationship between leaves and stems. In contrast, the roots and stems had the fewest genes in common (4,453 genes), which showed a significant difference in function. Roots have a simpler cellular and tissue structure than stems and leaves. The cells are less differentiated, and there are fewer different types of cells, so there are fewer genes specific to cellular function. The results suggest that the leaves and stems have more functional similarities than the roots. In addition, genes involved in abiotic stress tolerance belonging to the LEAs (10 genes), PLC/D (14 genes), and AP2ERF families (89 genes) were identified and analyzed for transcriptional expression from transcriptome data. This information can be used to suggest future studies on gene function and peanut cultivar improvement.
Identification and classification of SSR markers
The SSR analysis of the transcriptome data showed a total of 173,577 transcripts with a total length of more than 131 million bp. Each transcript contained an average of 0.19 SSRs, representing 0.44% of its length. Among the most frequently observed SSR types were Mono > Tri > Di, with 39.3%, 31.7%, and 17.2%, respectively (Fig. 7A). The average length of SSRs is relatively short, ranging from 15 to 26 bp. Among them, the mono SSR type has the shortest length with the fewest repetitions compared to other SSR types. SSRs are molecular markers with high promise for application in gene mapping, genetic diversity analysis, or plant breeding support. The transcriptome dataset of the peanut L14 cultivar contains a large number of SSRs that can be used as molecular markers. Furthermore, all discovered markers were mapped onto the assembled transcriptome using the e-PCR algorithm from GMATA. The markers were mapped to 13,935 transcripts or contigs in the genome. The most common number of alleles (genotypes) was 2, with 5,762 markers (38.5%). This was followed by 3 alleles with 4,783 markers (31.9%). A total of 40,522 amplicons (gene fragments amplified by PCR) were predicted to be amplified from the mapped markers (Table 2). Each marker was predicted to amplify 2.7 amplicons (Fig. 7B) on average. The results of the in silico SSR marker mapping and polymorphism analysis provide valuable information for the development of genetic markers and the study of genetic diversity in peanuts. The large number of markers mapped to the transcriptome will be useful for a variety of applications, such as marker-assisted selection, germplasm characterization, and genome-wide association studies. The predicted amplicons can be used to design primers for PCR-based assays to validate the markers and study their genetic diversity.
Table 2
Summary of SSR marker and eMapping.
Total NO. of sequences with markers mapped to | 13935 |
Total markers mapped to input sequences | 14981 |
Total amplicons PCRed from mapped markers | 40522 |
Average amplicons per mapped marker | 2.70489 |
Gene expression analysis
The transcriptional expression of several genes involved in the abiotic stress response in the root, stem, and leaf tissues of peanuts was analyzed using a heat map based on FPKM values (fragment per kilobase million) from the transcriptome (Fig. 8). Transcriptional expression patterns of the LEAs (A), phospholipase C/D (PLC/D) (B), and APETALA2/ethylene (AP2/ERF) (C) gene families across plant tissues provide information on their role in abiotic resistance mechanisms. The LEA gene family, known for its ability to protect cellular structures and macromolecules from damage caused by dehydration (Battaglia et al. 2008; Hundertmark and Hincha 2008), expressed different levels of genes across the data analysis. The results showed that LEA5, LEA34, and LEA1 were expressed more in roots and leaves than in stems, indicating the participation of these genes in the above tissues in the process of water and nutrient absorption. Photosynthesis and transpiration are necessary for plants to carry out their basic metabolic functions during periods requiring water or abiotic stress factors (Liang et al. 2019; Wang et al. 2021). Additionally, the PLC and PLD gene families, involved in the production of signaling molecules such as inositol phosphate and phosphatidic acid, respectively, exhibited distinct expression patterns across transcriptome data. Specifically, genes PLDA1, PLDZ, PLC6, PLC2, and PLDB1 exhibited greater expression levels specifically within the roots. In contrast, several other genes, including PLCD4, PLC2, PLDD1, and PLDB1, displayed high transcriptional activity within the leaves, suggesting their potential roles in abiotic stress response mechanisms within these tissues. Research shows that high expression of PLC and PLD genes may contribute to the generation of signaling molecules that regulate various stress response pathways, including modulation of ion channels, transcriptional regulation of genes to respond to stress, and activation of various metabolic pathways to enhance drought tolerance (Tuteja and Sopory 2008; Bargmann et al. 2009; Wang et al. 2014). Moreover, heatmap analysis data revealed that the transcriptional expression of several genes belonging to the AP2/ERF family (such as ERF1, ERF2, DREB1F, DREB2C, and CBF1) in the roots and leaves of the L14 peanut cultivar was higher than that in the stems. These transcription factors are known to play important roles in regulating stress response pathways against various abiotic stresses, including drought, salinity, and low temperatures (Sakuma et al. 2002; Xu et al. 2011; Feng et al. 2020).
In addition, RT-PCR analysis provided a comparison of LEA5 transcript expression in the root tissue of drought-tolerant L14 peanut cultivars and local peanut cultivars under artificial drought conditions using PEG 6000 with different concentrations (Fig. 9). In the peanut L14 cultivar, it was shown that the regulation of LEA5 gene expression increased significantly when the PEG 6000 concentration increased from 5–30% and peaked at a concentration of 15% (2.5-fold higher than the control), showing its important role in protecting against dehydration and resisting osmotic pressure. The local cultivar showed similar induction of LEA5 gene expression at PEG 6000 concentrations of 15–20%, although at relatively lower expression levels. This differential expression suggests that the enhanced drought tolerance of cultivar L14 may be at least partly due to a stronger induction of the LEA5 gene.