The advent of next-generation sequencing has made it possible to sequence the genomes and transcriptomes of various insects. However, one of the main challenges in transcriptome analysis is the biological interpretation of these sequences [11]. For several nonmodel species without available genomic reference information, transcriptome sequencing is an effective and alternative method to gain insight into the information content of a genome. Till date, none of studies have published Hullula undalis transcriptomes by next-generation sequencing. Transcriptome assembly for various organisms, including Anthonomus grandis [12], Plutella xyllostella [13], Rhopalosiphum padi [14], Ipomoea batatas [15], Eucalyptus grandis [16], Acropora millepora [17], Bemisia tabaci [18], and Dialeurodes citri [19] has already been reported.
In this study, the cDNA of Hullula undalis was sequenced using the Illumina NovaSeq6000, resulting in a total of 48 million raw reads with a length of 2 x 150 base pairs. After filtering out low-quality and contaminated reads, as well as those containing over 2% unknown bases, approximately 97% of the raw data remained. This filtering step was crucial to ensure the accuracy of the subsequent sequence assembly. The average GC content of the raw reads was found to be 51.36%. The processed reads of Hullula undalis were then used for de novo assembly using the CLC Genomics Workbench. A total of 30,451 contigs were generated. Out of the 30,451 Hullula undalis contigs, 21,806 contigs showed significant similarity to sequences in the NR db v5. The top blast hits in the NR db with Spodotera litura
The distribution of transcripts annotated by Gene Ontology is similar to the transcriptome studies carried out in other insects [6] where cell, cell part, cellular process, and metabolic process were among the predominant GO classes. In our study, The GO mapping and annotation results contigs associated with molecular functions were the most prevalent, followed by contigs associated with biological processes and cellular components. The COG database comprises proteins obtained from the entire genetic material of bacteria, archaea, and eukaryotes [20]. COG is utilized for categorizing organisms based on their evolutionary relationships, as it assumes that each protein in the database can be traced back to a common ancestor through vertical evolution, known as the orthology concept [21]. In this study, COG database categorized 9,646 contigs (31.67%) into 21 functional classes of COGs. The largest proportion of unigenes fell into the "Signal transduction" category, followed by "post-translational modification, protein turnover, chaperone functions", "Transcription", "Carbohydrate metabolism and transport", "RNA processing and modification", "Lipid metabolism", "Intracellular trafficking and secretion" and "Translation" classes. 123 pathways from transcriptomic data associated with metabolism, genetic information processing, environmental information processing, cellular processes and organismal system. Among these categories, the contigs associated with metabolism were most prevalent, followed by environmental information processing, genetic information processing, organismal systems and cellular processes.
The assembled transcripts were analyzed using the KEGG pathway database, which is a collection of graphical representations depicting metabolic and regulatory pathways. These pathways include interactions between enzymes, proteins, and genes, enabling the understanding of complex functions through the examination of interconnected molecular networks at a systemic level [22]. This approach facilitates the interpretation of broader biological processes. In our study, the contigs associated with metabolism were most prevalent, followed by environmental information processing, genetic information processing, organismal systems and cellular processes (Fig. 5). Consistently, when comparing the gene numbers in the metabolism, genetic information processing and environmental information processing categories associated with global and overview maps, transcription and signal transduction.
There are several conventional approaches used to develop microsatellite loci, including the hybrid capture method [23], selection of loci from existing genetic/genomic data [24], and the transfer of loci from closely related species [25]. However, in comparison to these traditional methods, de novo transcriptome sequencing technology offers a faster, more cost-effective, and reliable approach for directly developing microsatellite markers from transcriptome sequences. This is particularly advantageous for non-model species [8, 26–30]. A total of 1,913 potential SSRs were detected in the contigs. Among these SSRs, 129 were classified as compound structures, indicating more complex repeats, and 203 contigs contained more than one SSR. The analysis revealed a total of 31 distinct sequence motifs within the identified SSRs. Notably, the mononucleotide repeats A and T were the most abundant, occurring 619 times. Among the di-nucleotide repeats, the AT and TA repeats were the most frequently observed, appearing 80 times. Furthermore, tri-nucleotide repeats AAT and ATT were found 76 times, while the tetra-nucleotide repeats ACAT/ATGT were identified 10 times (Fig. 7). These findings provide valuable insights into the repetitive DNA sequences present in Hullula undalis, which can be utilized as genetic markers for further genetic enhancement studies.