Chloroplast genome skimming of a potential agroforestry species Melia dubia. Cav and its comparative phylogenetic analysis with major Meliaceae members

Melia dubia Cav. is a fast-growing plywood species gaining popularity due to high economic returns. This study aimed to assemble and annotate the chloroplast (cp) genome of M. dubia and compare it with previously published cp genomes within the Meliaceae family. The chloroplast genome was constructed by the de novo and reference-based assembly of paired-end reads generated by long-read sequencing of genomic DNA. The cp genome, sized 171,956 bp, comprised a typical angiosperm quadripartite structure. The large single-copy (LSC) region of 76,055 bp and a small single-copy (SSC) region of 18,693 bp cover 55% of the genome. The pair of inverted repeats (IRA and IRB) were 38,604 bp each (covering 45% of the genome). We identified unique genes (112), including protein-coding genes (79), tRNA (29) and 4 rRNA genes. Phylogenetic analysis using complete cp genomes of 11 species from Meliaceae revealed that M. dubia and M. azedarach shared a sister clade. Comparative analysis using cp genome of M. dubia, M. azedarach and Azadirachta indica revealed a high sequence similarity (> 70%). Five intergenic regions were highly conserved among the three cp genomes. The gene trnG-UCC at LSC region was found to be more divergent in M. dubia and M. azedarach, while it shows complete conservation within M. dubia and A. indica. This is the first report of the chloroplast genome in M. dubia. The available levels of taxonomic expertise and clarity in species delineation within the Melia genus are low. The information generated provides scope for identifying new barcodes which increases the discriminatory power of the species within the genus beyond morphological identification.


Introduction
Chloroplasts (cp) are plastids characterized by a doublelayered membrane and having independent DNA and thylakoid structures (Mustardy et al. 2008). They evolved through endosymbiosis between a photosynthetic bacterium and a non-photosynthetic host (Guo et al. 2022), retaining their distinct genome information (Howe et al. 2003). This intracellular organelle is the site of photosynthesis, providing energy for plants and algae and supporting biosynthesis of primary metabolites. A typical angiosperm cp genome is quadripartite comprising a large single copy (LSC) and small single-copy region (SSC) separated by a pair of inverted repeats (IRs) (Li et al. 2017;Liu et al. 2022. The number of cp genomes reported and deposited in the National Center for Biotechnology Information (NCBI) database is rising.
The plastid is non-recombinative and displays uniparental inheritance (Daniell et al. 2016). Compared with nuclear and mitochondrial genomes, cp genomes are the most conserved in DNA sequences, organization, and structure. Hence, they are widely used in phylogenetic analysis and molecular taxonomy.
Mader et al. report complete cp genomes of four Meliaceae species (Order: Sapindales) and their comparison with neem (Azadirachta indica), the first published Meliaceae plastome sequence (Mader et al. 2018). Presently chloroplast DNA (cpDNA) sequences are available for Melia The complete Chloroplast genome sequence of Melia dubia was submitted to the third party annotation section of the GenBank databases at the National Centre for Biotechnology Information (NCBI) under the accession number TPA: BK059531.
cp genome is usually constructed by direct isolation of the cpDNA (Baek et al. 2021) or cp genome amplification using PCR from total DNA. However, neither method provides sufficient yield and quantity for all the plant species (Twyford et al. 2017). Since whole cellular DNA consists of chloroplast, mitochondrial and chromosomal DNA, whole genome sequence (WGS) data are used to construct the cp genome directly (Freudenthal et al. 2022).
Since the first genome sequencing in 1986 (Ohyama et al. 1986;Shinozaki et al. 1986), different strategies and various bioinformatics tools have been developed to assemble cp genomes (Twyford et al. 2017). Deducing a cp genome from the WGS data requires extracting cp reads from the WGS followed by assembly and resolution of the unique circular structure, including the inverted repeats. In the absence of a reference cp to map the reads, a k-mer analysis is used to extract the most frequent reads. One such toolkit GetOrganelle (https:// github. com/ Kingg erm/ GetOr ganel le) is available in the public domain to extract cp sequence data. This toolkit applies the "baiting and iterative mapping" approach to recruit organelle-associated reads (Jin et al. 2020).
CPGAVAS 2.0 is an integrated web server for annotating, visualizing, analysis and GenBank submission of cp genome sequence. The software is freely accessible from http:// www. herba lgeno mics. org/ cpgav as2/. Based on the identification and mapping of the similar protein, CPGAVAS predicts protein-coding, rRNA and cDNA sequences by integrating results from BLASTx, BLASTn, protein2genome, est2genome programs. Further, tRNA genes and inverted repeats are identified using tRNAscan, ARAGORN and vmatch. It also generates a circular map (Liu et al. 2012).
Melia dubia Cav. (Family-Meliaceae) is a deciduous fast-growing multipurpose species distributed in India, Sri Lanka, Myanmar, Malaysia, Java, China and Australia. It has high-quality termite and fungus-resistant timber, making it a money-spinning species for farmers. It is widely used in agro and farm forestry plantation programmes. Melia dubia has the potential to be developed into biofuel because of its high photosynthesis efficiency and high biomass yield. The species has received considerable attention from researchers due to its short rotation, high coppicing ability and better economic returns (Geetha et al. 2019).
The species is often confused with Melia azedarach, belonging to the same family (Sivaraj et al. 2018). Though the Plants Of The World Online (https:// powo. scien ce. kew. org/) identifies them as separate species, the Plantlist (http:// www. thepl antli st. org/), reports Melia dubia as a synonym of M. azedarach. Reports are available on the use of barcodes to delineate them. However, there is little genomic information reported. Hence in this present study, we assembled the complete cp sequence of Melia dubia using whole-genome data to develop new markers for delineation. This is the first report on the cp genome information of the species.
Further annotations revealed its gene content and a circular map was constructed using CPGAVAS2 (Liu et al. 2012). The results also provide insights into the systematic classification. Secondly, a comparative analysis of Meliaceae species was conducted. Finally, highly variable and conserved genes of the M. dubia cp genome were identified compared to the M. azedarach and A. indica. The complete cp genome sequence would help to elucidate evolutionary and phylogenetic relationships in the other members of Meliaceae. The information would also provide comprehensive information relevant to the identification and scope for genetic improvement in the species.

Whole genome data
The whole genome sequence raw data of M. dubia was downloaded from NCBI SRA database (GenBank accession: SRR5576232) using fastq-dump tool of SRA-toolkit in two forward and reverse reads. The reads were then quality-checked using FastQC (version 0.11.9) tool. Lowquality reads -phred score < 30 and adapter sequences were removed using Trimmomatic v0.36 (Bolger et al. 2014) according to the following trimming parameters AVGQUAL and MINLEN with the threshold values 25 and 20, respectively. Cleaned reads were further taken for cp genome generation and assembly.

Chloroplast genome assembly and annotation
The de novo assembly of M. dubia cp genome from the whole genome sequence of M. dubia was carried out using GetOrganelle v1.7.5 toolkit. The generated complete cp sequence was annotated using CPGAVAS2− an integrated plastome annotator (Lu et al. 2022) and analyser http:// www. herba lgeno mics. org/ cpgav as2/, followed by manual correction using BLAST. The complete cp sequence generated was submitted to NCBI Third Party Annotation database under the accession number TPA: BK059531. A circular map was generated by CPGAVAS using the annotated genome.

Relative synonymous codon usage (RSCU) analysis
To determine the evolutionary diversity of specific genes the codon usage pattern was computed from the protein coding gene sequence of M. dubia cp genome. Compute codon usage bias from MEGA X software was employed for RSCU and frequency analysis (Thiel et al. 2003;Song et al. 2019;Jiang et al. 2022)

Phylogenetic analysis
The complete cp genome sequence of Melia dubia generated in the study was used. Twelve other species of the Meliaceae (Order Sapindales) and one species of adjacent order Geraniales were downloaded from NCBI database (Supplementary Table 1). The fourteen complete cp sequences were aligned using MAFFT v7.4.0.9 (https:// mafft. cbrc. jp/ align ment/ softw are/ index. html) (Li et al 2021;Wu et al. 2021;Zhu et al. 2022 with the parameters set to FFT-NS-2 (Katoh 2013). The aligned sequences were further trimmed to equal length using BioEdit v7.2.5. Maximum likelihood method was followed to infer the phylogenetic relationship with 1000 bootstrap replicated in MEGA 10.2.4 and a phylogenetic tree was generated (Kumar et al. 2016).

Comparative genomics of M. dubia cp genome
The newly sequenced cp genome of M. dubia was compared with two other closely related species namely M. azedarach and A. indica, respectively (NC_050650.1 and NC_023792.1). Multiple alignments were carried out using mVISTA programme, and sequence identities were plotted using shuffle LAGAN mode (Frazer et al. 2004;Ruang-Areerate et al. 2021. The annotation of M.dubia was set as reference.

General features of M. dubia cp
The complete cp sequence of M. dubia retrieved from WGS data (GenBank accession: SRR5576232) through de novo assembly using GetOrganelle toolkit was identified and submitted to third party annotation section of the GenBank databases under the accession number TPA: BK059531. A total of 2,89,24,426 raw reads with average read length of 150bp, was generated which on cleaning provided 2,72,94,460 processed reads (94.36% of raw reads).
The cp genome of M. dubia is a small, double-stranded circular structure with a genome size of 171,956 bp (Fig. 1). It exhibits a typical angiosperm quadripartite structure with a pair of inverted repeats (IRA and IRB) of the same size of 38,604 bp; accounting for 45% of the genome separated by a large (LSC) and a small (SSC) single copy regions of 76,055 bp and 18693 bp with coverage of 44% and 11%, respectively. The frequencies of all the four bases were 53,837,54,192,32,596 and 31,331 (Supplementary Fig . 1). G+C content of all land plants cp genome plays a significant role in its own structural evolution. The universal percentage of guanine and cytosine bases (GC%) of M. dubia cp was found to be 37.18%, with uneven distribution along the single copy and inverted repeat regions. GC % was low at SSC region (31.19%), moderate at LSC region (35.23%) and high at IRs (40.55%) ( Table 1).
GC% accounted for 40.55% in the IR regions, while in the LSC and SSC regions, GC% accounted for 35.23 and 31.19 %, respectively. In contrast with the LSC and SSC regions, IR regions had higher per cents of GC % (Table 2).

Chloroplast genome annotation and contents
A total of 146 genes were annotated using a web-based programme called CPGAVAS2− an integrated Plastome Annotator and Analyzer and validated using BLAST X. Melia dubia cp encodes 146 genes-112 are unique; they include 79 coding sequences (CDS), 29 transfer RNAs (tRNAs) and 4 ribosomal RNAs (rRNAs). 34 genes were in duplicate condition, thereby giving a total of 146 genes altogether that code for 101 CDS, 37 tRNAs and 8 rRNAs. For the 34 genes in duplicates, 22 were CDS, 8 were tRNAs and 4 were rRNAs (Fig. 2).
The protein coding genes of this cp was mainly classified into three categories (1) Genes for photosynthesis that contain 52 genes which include subunits of ATP synthase genes, photosystem I genes, photosystem II genes, NADH-dehydrogenase (2) Genes for self-replication that enclose 38 genes including large subunit of ribosome, DNA dependent RNA polymerase, small subunit of ribosome.   (Table 3).
Sixteen genes were in duplicates from CDS genes and found in inverted repeat regions. As with other land plants rps 12 in M. dubia is also a trans-splicing gene with its 5′ end located in LSC and two replicative 3′ ends at IRs. Fiftyone coding genes were in single copy and the remaining 16 were split genes. Twenty four genes in M. dubia cp harbours introns of which 2 genes (ycf3, clpP) harbour a second intron. The intron size ranged from 2544 bp for trnK-UUU to 527 bp for trnL-UAA (Supplementary table 2). Among the 37 tRNAs, 6 were duplicate genes, 17 genes in single copy and 8 were split genes. Moreover, there were 8 ribosomal RNA genes in duplicate conditions located at IR regions.

Codon usage patterns
The codon usage pattern of the chloroplast genome is very useful for studying the degree of molecular adaptation and diversity during evolution. Melia dubia cp uses 64 different codons that code for 21 different types of amino acids. The number of codons ranged from 84 to 2257. Methionine and tryptophan amino acids possess only one codon while the remaining amino acids hold 2-6 codons (Fig. 3).

Analysis of SSRs in M. dubia chloroplast
Identifying the diversity in species using molecular markers and its development can facilitate analysis of population genetics, species identification and polymorphism studies in Melia dubia. Using MISA, we detected 81 SSR loci in Melia dubia cp genome, which includes 97.5% mononucleotides and 2.5% dinucleotides ( Supplementary Fig. 2). The predominant SSR motifs were thymine (T) and adenine (A) with an average frequency of 51 and 44%. This abundance contributes to the bias in AT rich sequences than GC in cp genome. Dinucleotides were completely composed of either AT or TA. No other nucleotides were present in M. dubia.

Repeat structure and SNP
Repeat motifs play crucial role in genome rearrangement and phylogenetic analysis. The REPuter results for the cp genome of length 171,956 bp with maximum distance 3 and minimum repeat size 30 bp revealed 3 categories: forward (23), reverse (1), palindromic (26) repeats. The TRF tool identified sixty-five long tandem repeats with repeat unit size > 6 bp (Supplementary Tables 3, 4). SNP loci are valuable resources in species determination and population structure analysis. Using M. azedarch chloroplast genome as a reference SNP analysis was carried out. The result revealed 1014 SNPs in M. dubia. Six hundred thirty-four and 220 SNP markers were present in LSC and SSC regions, respectively whereas SNPs in IRs were less compared to other two  psbA,psbB*,psbC,psbD,psbE,psbF,psbH*,psbI,psbJ,psbK,psbL,psbM,psbN*,psbT*,psbZ,ndhB*,ndhC,ndhD,ndhE,ndhF,ndhG,ndhH,ndhI,ndhJ,  regions indicating IRs high degree of conservation (Supplementary Tables 5, 6). Out of 24 intron harbouring genes, 17 genes detected with 182 SNP markers. The two stretches of the coding region within the chloroplast that make up the genes rbcL and matK are the important DNA barcode markers for terrestrial plants. So these genes were extracted from chloroplast genome and SNPs were mined. There were 18 and 8 SNP loci in matK and rbcL genes respectively (Supplementary Table 6).

Phylogenetic relationship analysis
The complete cp genome of 12 Meliaceae members and Melianthus villosus outgroup species belonging to Geraniales order were downloaded from NCBI database and aligned along with the constructed cp genome of Melia dubia (Supplementary Table 1). Phylogeny using cp genomes was resolved and determined with 100% bootstrap in almost all nodes (Fig. 4). A high resolution on relationships was demonstrated in phylogenetic tree by dividing the Meliaceae members mainly into two independent branches representing Subfamilies Melioideae and Cedreloideae. In the Tribe Melieae, Azadirachta indica A. Juss. made a basal branch and cp genome of Melia dubia. Cav., Melia azedarach L. occurred as sister clades with a common internal node showing that they share a common ancestor. Aglaia odorata Lour. and Aphanamixis polystachya (Wall.) R. Parker was found to be closely related species within the tribe Aglaileae.
In the tribe Cedreleae, Cedrela odorata L. was the initially diverged lineage over the sister clade Toona ciliata M.Roem. and Toona sinensis (Juss.) M.Roem (Muellner et al. 2009  and also clustered as separate terminal branch from Meliaceae during the tree construction. This is in agreement with reports from other researchers who showed similar grouping amongst Meliaceae members (Mader et al. 2018).

Comparative genomics analysis
Comparative analysis using cp genome of M. dubia, M. azedarach and A. indica was performed using mVISTA. The cp genomes had a relatively high degree of conservation in exon regions than the Conserved non coding sequence (CNS) and introns. The sequence similarity was 72%. Trans splicing gene rps 12 was found to be conserved in 3 plastomes. trnG-UCC gene at LSC region was found to be more divergent in M. dubia and M. azedarach, whereas shows complete conservation within M. dubia and A. indica. rpoC2-rpoC1, petL-petG, psbN-psbH, rpl23-trnI-CAU, and psbH-petB were the intergenic regions which are highly conserved among 3 cp genomes (Fig. 5). Based on pairwise alignments using mVISTA, large deletions were identified for M. azedarach at the intergenic region between ndhB and trnL-CAA (Fig. 5). This deletion is a size of about 1587 bp ( Fig. 5; at position 148322-149905 of the Melia azedarach sequence).

Discussion
In this study, the complete cp genome of Melia dubia was sequenced and annotated ( Fig. 1, Table 1) and compared with previously published Meliaceae cp sequences, using the sequences of nine other species. Across eight other families within the Sapindales, more than 50 complete cpDNA sequences are available at the Organellar Genome Resource at NCBI (NCBI, 2022). There is a need to enrich the database with members from Sapindales to better understand the evolution of different families in this Order. The GC (guanine +cytosine) content accounted for 40.55% in the IR regions. In contrast, in the LSC and SSC regions, GC contents accounted for 35.23 and 31.19%, respectively. In contrast with the LSC and SSC regions, IR regions had a higher percent of GC content. The high GC contents in the IR regions could be attributed to DNA leading and lagging strands, origin of replication and termination, which are always highly conserved regions (Saina et al. 2018).
The genes of chloroplast mainly work for photosynthesis and self-replication. In addition, there are six genes with other functions and 4 with unknown functions (Table 2). We found clpP and ycf3 in duplicate, with two introns, similar to reports in cp genomes of four Meliaceae species (Mader et al. 2018)

Codon usage bias
Using the coding sequences, we calculated the RSCU frequency of the cp genomes of M. dubia, M. azedarach and A. indica (Supplementary Fig. 3). The total protein-coding genes had 54251 codons in M. dubia, whereas 53464 and 53579 codons in M. azedarach and A. indica, respectively. The most abundant and least prevalent amino acids were found to be isoleucine and cysteine in M. dubia. Isoleucine had 4592 codons, making up 8.46% of the total. Contrastingly, 6042 and 5871 codons of glycine made up 11.2 and 10.90% of the total in A. indica and M. azedarach. cysteine, the least occurring in M. dubia accounted for 0.9% of the cp genome. In case of in A. indica and M. azedarach, phenyl alanine had the least count (469 and 452 codons corresponding to 0.8% of the total in both species). A common feature observed was that almost all RSCU values greater than one were in the A/U-ending codons, while all RSCU values less than one were in the C/G-ending codons in all the three species analysed in this study.

SSRs, SNPs and repeat analysis
Our study analysed the SSRs in the cp genome of M. dubia ( Supplementary Fig. 2). We detected only 81 SSR loci. The mono-nucleotide A and T repeats accounted for 95%, the highest proportion in all SSRs identified. These markers have high intraspecific variability and could help resolve species delineation issues. The general observation is that plastid SSRs are usually made up of A and T and rarely contain G or C repeats. Xie et al. attribute this that A-T transformation is simpler than G-C transformation in cp genomes (Xie et al. 2018). Hebert et al introduced the concept of DNA barcoding in 2003. Research's on the selection of standard markers as DNA barcodes has been increasing since the concept introduction. matK and rbcL are chloroplast encoded genes which are considered as potential markers for land plants. There were 18 and 8 SNP loci in matK and rbcL genes respectively (Supplementary Table 6). Tic214 protein, second largest in chloroplast genome essential for the plant viability is encoded by the gene ycf 1. ycf 1 gene has been recently considered as significant DNA barcode. The highest number of SNPs among the protein coding genes was found in ycf 1, i.e. 74 ( Supplementary Fig. 4).

Comparative analysis of chloroplast genomes and phylogenetic analyses
Complete cp sequences help decipher phylogenetic relationships, especially between closely related taxa, or in cases where limited sequence variation exists due to recent divergence, rapid speciation or slow genome evolution (Tonti-Filippini et al. 2017;Williams et al. 2016) 1 3 30 Page 10 of 12 Regions with the highest diversity between the species are marked by red boxes, while highly conserved region is represented by blue box Considering that most species-level diversity of Meliaceae is contemporary , a whole-plastome phylogeny provides insights into the patterns of evolution and delineation. The phylogeny that was generated in this study (Fig. 4) based on the whole-cp alignment for three Meliaceae species (Fig. 5), is an initial step in this direction. Here, a clear grouping was also observed between M. dubia and M. azedarach which was grouped as sister clades, revealing that the two species are distinct. Azadirachta indica, though a member of the subfamily Melioidaeae, formed a separate clade from the genus Melia. This, again is in agreement with previous studies in Meliaceae (Koenn et al. 2015;Muellner et al. 2009). This study, therefore, helps in the differentiation at lower taxonomic levels. Whole-genome alignments are also very useful to develop cpDNA markers for the genetic identification of species or other taxonomic categories (Kersten et al. 2016).
The large deletion ( Fig. 5; at position 148322-149905) in M. azedarach sequence mainly contributes to the smaller genome sizes of M. azedarach over M. dubia. Such deletions could be developed into effective species-specific markers, which must then be further validated with more species and individuals within the family. The SSRs were identified (Supplementary Table 4), after further validation, could serve as targets for future marker development to delineate M. dubia from M. azedarach. PCR-marker development based on indels in small intergenic regions might be particularly robust, because the primers could be placed into conserved genic regions adjacent to the indel. Similarly, the trnG-UCC could serve as an excellent marker.
Studies focusing on cp genome variation among the Mahogany family, Meliaceae have provided a useful taxonomic framework for this study. We have added a cp genome reference for M. dubia to the growing collection of Meliaceae cp sequences. This is the first report of cp genome of M. dubia.

Conclusion
This genomic resource for M. dubia and the Meliaceae family will enable further detailed genetic study of Melia, a genus that contains three species. This would form the base to generate resources to provide a level of resolution that can support delineation of the species within the genus and further development of genetic tools for breeding in tree improvement, considering its potential as an agroforestry species.
Author contributions All authors contributed to the study conception and design. AS and KE contributed equally to material preparation, data collection and analysis. SS and KSK provided technical support during the initial phases of analysis. RRW involved in conceptualisation, resources, supervision, funding acquisition, writing original draft and editing. All authors read and approved the final manuscript.
Funding The study was funded by Indian Council of Forestry Research and Education.
Data availability All data generated or analyzed during this study are included in this published article (and its supplementary information files). The complete Chloroplast genome sequence of Melia dubia was submitted to the third party annotation section of the GenBank databases at the National Centre for Biotechnology Information (NCBI) under the accession number TPA: BK059531.