Insilco Analysis of Leishmania donovani Genome to Understand its Gene Expression Regulation

Leishmania donovani is an obligate intracellular parasite of human that causes a dreadful visceral infection that may involve liver, spleen, bone marrow, and lymph nodes. The specic feature of the species inuences its genetics. Changes in parasite adaptation to different environment, development of silent genes and parasite progression through its lifecycle may be determined by transcriptional organizations affecting its gene expression. Hence, Insilco analysis of Leishmania donovani BPK282A1 genome was performed to predict the promoter and regulatory elements involved in gene expression. Forty putative genes of Leishmania donovani BPK282A1 genes were obtained from National Center for Biotechnology Information and investigated using bioinformatic tools, such as Neural Network Promoter Prediction, MEME suite, CpG island nder, GOMo and CLC Genomics Workbench. In this analysis, the number of Transcription Start Sites (TSSs) in promoter regions of Leishmania donovani BPK282A1 varied greatly between 1 and 4 with majority (60%) of them having two or more TSSs. The study also revealed seven candidate motifs in the promoter regions of the gene. The common promoter motif for all Leishmania donovani BPK282A1 genes was MI with an E-value of 4.0e-116. Six of the identied motifs were related to zinc nger proteins that bind to the GC-rich regions in DNA. The study also revealed that most of both the promoter and the gene body regions contain at least one CpG island. Restriction analysis using MspI enzyme showed that Leishmania donovani BPK282A1 genes were CpG rich. In conclusion, the investigation unfolded expression regulation of Leishmania donovani BPK282A1 genes contributing to development of a robust drug and vaccine that targets the regulatory regions.


Introduction
Leishmania spp. are medically important intracellular parasites that are accountable to cause fatal infections ranging from small cutaneous lesions to more severe and life threatening visceral infections in humans (Holzmuller et al. 2002). Leishmaniasis is a neglected tropical disease that remains to impose a serious public health problems in many countries around the world because of absence of vaccines and limited availability of chemotherapeutic agents in use as of today (Rastrojo et al. 2018). The disease is widespread in tropical and subtropical countries and known to have both zoonotic and human to human transmission patterns with two main clinical forms that include cutaneous and visceral Leishmaniasis (Steverding 2017).
The main etiologic agent of cutaneous leishmaniasis is Leishmania major. It is commonly associated with lesions in the area where sand y bite occurs. Whereas, the cause of visceral leishmaniasis is Leishmania donovani which eventually spreads to internal organs culminating in death (Downing et al. 2011). The transmission of Leshmania donovani, a parasite that kills millions of people in east Africa in general and thousands in Ethiopia in particular was thought to be anthroponotic. However, recent genetic studies have claimed that there are possibilities of zoonotic transmission (Kassahun et al. 2015).
Leishmania donovani is an obligate intracellular parasite (Mukherjee, Chakraborty, and Chakrabarti 2016;Singh et al. 2016; Verma, Rastogi, and Mukhopadhyay 2017) that is postulated to inhabit phagolysosomes of the macrophages (Singh et al. 2016). Once the macrophage is inhabited the growth of the parasite will be favored with the help of phagosome mutation regulation which will eventually avoid destruction. As the result, antigen presentation, oxidative damage, apoptosis and immune activation will be depleted and the availability of the nutrients will be favored on the contrary (Podinovskaia and Descoteaux 2015). The parasite almost exclusively multiplies in the phagolysosomes of the infected macrophage and this affects the expression of gen in both the parasite itself and the host cell (Buates and Matlashewski 2001).
Leishmania donovani appears to have two distinct genomic features in uencing its genetics; where there are different copy numbers within an individual chromosome or where chromosome dosage varies among clonal population of a cell. On the other hand, once infection occurs it develops a mechanism that weakens microbicidal activity of macrophages interfering with the expression of Major Histocompatibility-II (MHC II) molecules after subsequent stimulation with interferon-γ (IFN-γ) (Buates and Matlashewski 2001). The parasites utilize various proteins and glycoproteins to interact with host cells establishing a long term infection with a clinical presentation of visceral Leishmaniasis (Mukherjee, Chakraborty, and Chakrabarti 2016).
Several complications may result from visceral Leishmaniasis that include hepatic cirrhosis, necrotizing oral infection, pulmonary tuberculosis, ocular complications, disseminated intravascular coagulations, nephritic syndrome, glomerulonephritis, neutropenia (Meyers et al. 2016), immunosuppression and pancytopenia (Mcgwire and Satoskar 2013) leading to concurrent or inter-current infections nally culminating in death (Meyers et al. 2016). There are various types of therapies to treat leishmaniasis that include rst line and second line regimens. The use of the drugs available may rely on regional guidelines and varies among several localities. Despite availability and accessibility of the drugs, ineffectiveness of treatment is increasing due to parasite resistance to the available drugs (Mcgwire and Satoskar 2013). Changes in parasite adaptation to different environment, development of silent genes and parasite progression through its lifecycle may be determined by transcriptional organizations where regulatory elements are located; i.e. the Transcriptional Start Sites (TSS) (Duncan et al. 2011). Leishmania donovani mainly loses its virulence through transcriptional and translational regulations (Sinha et al. 2018). Therefore, understanding the regulatory elements such as Transcription Factors (TFs), Transcription Factor Binding Sites (TFBSs) and CpG islands in Leishmania donovani genome that determines its gene expression is helpful to fully characterize its molecular mechanisms and design effective drugs, vaccines and other control strategies. Forty functionally putative genes of Leishmania donovani BPK282A1 genome were obtained from National Center for Biotechnology Information (NCBI) genome database (National Center for Biotechnology Information n.d.) on Nov 3, 2020 and analyzed using online regulatory elements determination tools. The corresponding TSS of each gene was analyzed using an online Neural Network Promoter Prediction (NNPP version 2.2) (BDGP: Neural Network Promoter Prediction n.d.), a tool that helps to nd promoters in a Deoxyribonucleic Acid (DNA) sequences of eukaryotic and prokaryotic cells.
Herein, at least 1kb upstream of the start codon for each gene was excised as it was done by Dinku and colleague (Dinka and Milkesa 2020). The excised segments were tted to NNPP version 2.2 by setting the minimum standard predictive score (between 0 and 1) with a cut off value of 0.8 (Kanhere and Bansal 2005). In conditions where there were two or more TSSs, the one having the highest prediction value was considered as statistically signi cant (Michaloski et al. 2011).

Motifs and transcription factors identi cation
The identi ed promoter sequences were transferred to another online resource, Multiple Expression motifs for Motif Elicitation (MEME) suit version 5.3.0 (Introduction -MEME Suite n.d.); and analyzed for common candidate motifs (Release Notes -MEME Suite n.d.). Motif is an approximate sequence pattern that occurs repeatedly in a group of related sequences. MEME discovers novel, un-gapped motifs (recurring, xed-length patterns) in the identi ed sequences being submitted. It is a motif-based sequence analysis tool that splits variable-length patterns into two or more separate motifs. It represents motifs as position dependent letter probability matrices that describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. This tool also takes as input a group of sequences and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences and description for each motif (Bailey and Elkan 1994). The identi ed motifs were exported to another web-based tool, TOMTOM (Tomtom -Submission form n.d.); which is used to compare one or more query motifs against one or more databases of target motifs (and their reverse complements when applicable), and reports for each query a list of target motifs, ranked by p-value. The E-value and the q-value of each match is also reported. The q-value is the minimal false discovery rate at which the observed similarity would be deemed signi cant (Gupta et al. 2007).

Gene ontology
The identi ed motifs were also tted to another motif-based sequence analysis tool, Gene Ontology for Motif (GOMo); under MEME suite version 5.3.0 to scan all promoters using nucleotide motifs provided to determine if any motif is signi cantly associated with genes linked to one or more Genome Ontology (GO) terms. The signi cant GO terms can suggest the biological roles of the motifs. The program searches in a set of ranked genes for enriched GO terms associated with high-ranking genes. The genes can be ranked, for example, by applying a motif scoring algorithms on their upstream sequence (Buske et al. 2010a).

Identi cation of CpG islands
A server based online CpG islands prediction analysis for genome sequence exploration was used to search for CpG islands (Database of CpG Islands n.d.). Genomic islands play an important role in medical, methylation and biological studies. CpG islands are springs of characteristically unmethylated DNA with a high content of the Cytosine (C) and Guanine (G); i.e., high GC content compared to the bulk DNA. If genes are expressed, the CpG islands of promoters are unmethylated whereas when methylation of these sites occurs it is essential in expression of genes. Majority (40%), of the CpG islands are located in the promoter region. Herein, the CpGPAP platform employs the use of Takai and Jones algorithm. In this algorithm, a DNA sequence with a length of >500bp, GC content >55% in that region and Observed/Expected (O/E) ratio of >0.65 was considered. The algorithm can successfully exclude false positives from short repetitive sequences (Chuang et al. 2012). Additionally, CLC Genomics Workbench ver. 5.5.2 (CLC Genomics Workbench n.d.) was used to search for the restriction enzyme MspI cutting sites with a fragment selection of 40 to 220pbs having in mind that the region enriches CpG islands.

Number of TSSs and predictive values of L. donovani BPK282A1 genome
The TSS of the parasite was predicted among forty putative genes of L. donovani BPK282A1 genome and the important variables such as gene identi cation number, corresponding promoter region, predictive value and location of the TSS were summarized in table 1. This in silico analysis of L. donovani BPK282A1 genome showed that a single sequence can have as low as one TSS and as high as four TSSs. Where there were two or more TSSs the one with highest predictive score was considered the best TSS location of the gene. In all cases, the TSSs have a predictive score that ranged from 0.80 to 1.00. Majority (82.3%), of the TSSs identi ed were located below -500bp away from the start codon of the gene being expressed. Surprisingly, the TSS of one gene was located at -1338 which is very apart from the start codon. The number of TSSs were signi cantly varied between 1 and 4 among the promoter region sequences. The analysis revealed that about forty percent (40%) of the gene had only one TSS while only one gene had four TSSs. On the other hand, the location of the TSS of one sequence (Pro-LDBPK_060-590) was remarkably close to the start codon which was only 2bps away upstream of the respective gene. 3.2 Promoter regions with potential motifs in L. donovani BPK282A1 The spatial distribution and location of the motifs were presented below in gure 1. The motifs were found on both positive and negative strands. The shortest distance of the location of the motif was only 2bps upstream of the TSS while the longest distance was 4941bps upstream of the TSS of the respective gene. Majority, 188 (72.3%) of the motifs were highly distributed on positive strand than the negative strand.
On the other hand, the study also revealed that in about twenty-seven (67.5%) sequences the location of the rst motif was held in the rst -500bps upstream of the respective gene. Furthermore, the promoter sequence of gene LDBPK_151500 was exceptionally too long and contains seven motifs among which only one motif was located on the negative strand.
The study has also revealed seven candidate motifs in the promoter regions being analyzed using serverbased MEME suite motif identi er. The analysis was performed at a cutoff value of 0.80 and provided MI, MII, MIII, MIV, MV, MVI and MVII motifs. The minimum site per motif was two while the maximum site per motif was forty. The identi ed motifs were presented below in table 2 with their respective total number of binding sites and the level of statistical signi cance (E-values). It was shown that MI was the common promoter motif for all (100%) L. donovani BPK282A1 genes that regulates transcription and translation of the genes being studied. Statistically, it shares at least 85% of the promoters containing each one of the motifs with an E-value of 4.0e-116. The sequence Logo of the best motif identi ed was presented in gure 2. TOMTOM, a web-based application was used for further analysis of the similarities between the identi ed common motif and other similar motifs which are publicly available in known registered biological databases of eukaryotes. Accordingly, the web-based tool was used three different databases, namely; jolma2013, JASPAR2018_CORE_vertebrates_non-redundant and uniprobe_mouse to search for similarities and 843, 579 and 386 motifs were used from each of the databases respectively and only seven matches similar to the submitted MI motif were identi ed. Six of the identi ed matches were closely related to zinc nger proteins that bind to the GC-rich regions in DNA and regulate various cellular functions. The remaining one of the matches was found to be related to AP2/B3-like transcriptional factor family protein which is found in Hordeum vulgare (barley) and the gene functions in vernalization. The possible binding candidate motifs of MI, suggested species, e-value and gene functions were presented in table 3.

Gene ontology
Having been compared the common motif using publicly known databases in TOMTOM, another server based investigation using GOMo (Gene Ontology for Motifs) (Buske et al. 2010b; GOMo Results n.d.) was used to look for GO terms for promoter regions of L. donovani BPK282A1 gene and no signi cant GO term was associated with the common motif (MI) submitted (Figure 3).
3.4 Determination of the CpG islands of L. donovani BPK282A1 genome in promotor and gene body regions CpG islands, stretches of DNA that are 500-1500bp long with CG:GC ratio of more than 0.65% in promoter and gene body regions were determined using two algorithms. First, an online CpG Island nder (Database of CpG Islands n.d.) that depends on Takai and Jones rigorous criteria where parameters were set as ob/ex ratio of the CpG islands ≥0.65, GC content of at least 50% and minimal length of ≥500bps was used to identify possible CpG islands present, the number of CpG islands found and the percent of GC contents in each of the promoter and gene body regions (Takai and Jones 2002). Herein, the search revealed that thirty-one (31) promoter regions of the L. donovani BPK282A1 contain at least 1 CpG island. Whereas nine of the sequences did not contain any CpG islands (Table 4).
Due to technical limitations, Table 4 is only available as a download in the supplemental les section.
To examine the presence of CpG islands in the gene body regions using online CpG Island nder (Database of CpG Islands n.d.), genes with less than 500bps were excluded maintaining consistency depending on the predetermined criteria of Takai and Jones as discussed previously. Accordingly, three genes (LDBPK_151100, LDBPK_060590 and LDBPK_180690) were excluded from the search. Thus, from the included thirty-seven gene body regions in the search algorithm thirty-two of them had at least one CpG island and two genes (LDBPK_180360 and LDBPK_180270) had 100% CpG concentration. In ve of the gene body regions (LDBPK_212210, LDBPK_030760, LDBPK_131090, LDBPK_303160 and LDBPK_180790) no CpG island was found (Table 5). Another algorithm (CLC Genomics Workbench 5) involving the use of restriction enzyme MspI was employed to examine the fragments of CpG islands in both the promoter and gene body regions regardless of the existence of methyl group. Methylation protects DNA cleavage by endonucleases HpaII at internal C position of its recognition site CCGG sequence. MspI restriction enzyme is an isoschizomer of HpaII that can cleave DNA irrespective of the presence of methyl group at the internal C residue positions (Waalwijk and Flavell 1978).

Discussion
Reliable identi cation of the TSSs is a prerequisite to realize the regulation and expression of genes in any organism in a promoter sequence (Halees, Leyfer, and Weng 2003). As stated by Dinka and colleague, genes may have one or more TSSs contributing to genetic variation of an organism (Dinka and Le 2018) which is in agreement with this study where the analysis revealed about forty (40%) percent of the genes had one TSS and sixty percent of them had two or more TSSs. They also showed that the locations for 80% of the TSS were within -518bp relative to the translation start codon which is also similar to 82.3% in this study where the location of TSSs identi ed were located below -500bp away from the start codon of the gene being expressed.
In eukaryotes, the occurrence of transcription regulation within 25bps upstream of the TSS were associated with core promoter while it was associated with promotor proximal elements anywhere within the range of 50 to 200bps upstream of the TSS. On the other hand, transcription regulation in either direction and orientation distant from transcription initiation sites were associated with distal enhancer promoter (Nikolov and Burley 1997). In this study, the analysis showed that the location of the best TSS for one promoter sequences (Pro-LDBPK_060-590) was at -2bps from the start codon where transcription is associated with core promoter. Whereas, the location of best TSSs in seven promoter sequences were associated with promotor proximal elements. The other 32 promoter sequences had the best TSS location of greater than -200pbs away from start codon and therefore may be associated with distal promoter enhancer elements.
Searching for motifs is not an insigni cant task and seems bottleneck where efforts are required to large scale projects that attempt to exactly identify the binding positions of TFs and to understand mathematical models employed for TFs (Reid et al. 2010). The most commonly used mathematical model is position matrix weight (PMW) model that can be used to understand the position dependent frequency or probability of each nucleotide in a motif (Boeva 2016). With this regard, the information contents of the most conserved nucleotide for the seven motifs identi ed were 28.9, 26.4, 23.8, 17.2, 22.3, 18.3 and 19.7 for MVI, MVII, MVIII, MVIV, MVV, MVVI and MVVII respectively as shown by sequence logo where the total height of each bin was considered the information content in bits of the corresponding position. Searching the binding sites in long promoter regions is likely to yield false prediction of the binding sites while it can also help to exceptionally provide true predictions (Boeva 2016). In this study, 20% of the promoter regions included in the analysis had a nucleotide length of greater than 2000bps.
Thus, additional analysis using different models is required whether the predicted values for those promoter regions were true or false.
Razavi et al. in their study on cloning and expression of leishmanolysin gene from Leishmania major in primate cell lines found that L. guyanensis and L. major possess zinc binding motifs which is essential for attachment of the parasite to the host cell and internalization (Razavi et al. 2001). Accordingly, from the seven matches that were identi ed in web based TOMTOM search in this study, six of them were associated with zinc nger protein binding TF in Homo sapiens and Mus musculus. Hence, there is possible functional similarities between the identi ed motif of L. donovani BPK282A1 gene and motifs of the L. guyanensis and L. major being reviewed. However, cloning and expression studies are required to establish scienti cally plausible similarities.
Up to 20% of eukaryotic genes and promoters contain tandem repeats. Tandem repeats are sequences that are highly unstable and subject to increased mutation rates which is usually 10 to 100,000times compared to mutations in other parts of a gene (Liang et al. 2012;Tsang 2014). The occurrence of the tandem repeats in a motif sequence is responsible for evolution of certain traits (Liang et al. 2012). With this regard, the identi ed best motif sequence logo of the L. donovani BPK282A1 gene in this study seems to have a tandem repeat sequence which probably contributes to the emergence of drug resistant strain of L. donovani as mutations in DNA binding motifs are responsible to changes in gene expression and function. Furthermore, it was also known that Leishmania donovani inhabits phagolysosomes of the macrophages (Singh et al. 2016) where it undergoes phagosome mutation regulation which will eventually avoid its destruction. As the result, antigen presentation, oxidative damage, apoptosis and immune activation are depleted and the availability of the nutrients are favored on the contrary (Podinovskaia and Descoteaux 2015). The reason for such escape mechanism may be due to the presence of the tandem repeats in the protein binding regions of the promoter sequences as discussed earlier. However, a thorough study that includes wet laboratory and bioinformatic tool must be conducted to understand in which stage of the parasite mutation typically undergoes.
Determination and identi cation of CpG islands is one of the integral part of biological, methylation and medical studies (Chuang et al. 2012). In vertebrates, 50% of the genes contain CpG islands (also called gene markers) in their promoter region. In spite of the fact that CpG islands are unmethylated in promoter regions, 80% of mammalian genome are methylated. As mutation rate in methylated CpG is higher than that of the unmethylated CpG, there is greater chance of loss of CpG islands which subsequently results in change of gene expression and function (Han and Zhao 2008). As previously discovered by Williams et al promoters with high CpG content are largely hypomethylated (Williams, Christensen, and Helin 2012). Das et al in 2014 discovered that L. donovani genes are highly enriched for CpG islands. In addition, they also revealed that CpG motifs delayed programmed cell death in macrophages which may be helpful to extend the parasite lifespan conferring the advantage of infection progression (Das et al. 2015). In line with this, the current Insilco analysis also discovered that L. donovani BPK282A1 genome is rich in CpG in both the promoter and gene body regions. On the other hand, It was also explained that antiapoptotic effect of L. donovani was observed where there was DNA methylation (Das et al. 2015).
The GC coverage of a gene region can be affected by its GC concentration. Hence, regions with 50 to 60% GC content do have the highest coverage compared to gene regions with GC content of 70-80% and gene regions with GC content of 30 to 40% (Amr and Funke 2015). With this regard, this Insilco analysis showed that majority (57.5%) of the gene promoters and close to half (47.5%) of the gene body regions do have the highest GC coverage. On the other hand, this study also indicated a great GC content variation that ranged from 50% to 100% (in two genes) which agrees with a previous literature indicating GC content in a eukaryotic organisms is highly variable and may be from as low as 30% to as high as 90% (Sharp 2001).

Conclusion
This Insilco analysis has unraveled great variations in the number of TSSs in promoter region and percent GC concentration in both promoter and gene body sequences. The study also discovered CpG rich sequences and the common motifs shared by Leishmania donovani BPK282A1 genes and the possible transcription factors associated with the sequence. Hence, the investigation unfolded gene expression regulation of Leishmania donovani BPK282A1 genes bringing new insight and contributing to understand mechanisms that underly its adaptation to different environments and development of robust drug and vaccine that targets the regulatory regions.

Declarations
Data availability All the data were obtained from NCBI and available online.

Competing interest
All authors do not have any con ict of interest and have seen and read the nal version of this manuscript.
Authors' contribution KJ conceived the idea and developed the proposal. He also obtained the data from NCBI and performed all the statistical analysis using different software and wrote the initial and nal version of the paper. MK supervised the overall work and gave advise on how to write the manuscript. AM helped in editing and providing technical support on how to use online tools. Table4.docx