Genome-Wide DNA Methylome and Whole-Transcriptome Landscapes of Spontaneous Intraductal Papilloma in Tree Shrews (Tupaia Belangeri)

Background Breast intraductal papilloma (IP) is mainly caused by the abnormal proliferation of ductal epithelial cells. Tree shrews are a potential animal model for studying breast tumours. However, little is known about the transcriptome and DNA methylome landscapes of breast IP in tree shrews. In this work, we performed whole-genome DNA methylation and transcriptome analyses of breast IPs and normal mammary glands in tree shrews. Results DNA methylation proles with a single-base resolution were generated via whole-genome bisulphate sequencing (WGBS) in both the IP and Control groups of tree shrews. This provided a genome-wide perspective regarding the epigenetic regulation of protein-coding genes in breast IP in tree shrews. The methylation levels at CG sites were considerably higher than those at CHG and CHH sites, and methylation levels were highest in gene body regions. We identied 3486, 82 and 361 differentially methylated regions (DMRs) in CG, CHG, and CHH contexts, respectively, and 701 differentially methylated genes (DMGs) were found. Furthermore, transcriptomic analysis identied 62 differentially expressed genes (DEGs), 50 long noncoding RNAs (lncRNAs), and 32 circular RNAs (circRNAs) in IPs compared with normal mammary glands. Correlation analysis between the DNA methylation and transcriptome data showed that 25 DMGs were also DEGs, among which the expression levels of 9 genes were negatively correlated with methylation levels in gene body regions. Importantly, integrated analysis identied three genes, PDZK1, ATP2B4 and LCP1, that could be used as candidates for further studying breast IP in tree shrews. Overall, this research provides the comprehensive landscape of the transcriptome and DNA methylome of spontaneous IP in tree shrews and highlights candidate genes for eliciting tumorigenesis. These results contribute to the application of tree shrews as an animal model of breast tumours. were involved in insulin secretion (P = 0.0001) and the oestrogen signalling pathway (P = 0.0002) in the endocrine system. Thus, we integrated the gene expression and DNA methylation maps and identied protein-coding genes with underlying changes related to DNA methylation in IPs; the resulting alterations probably induced the development of mammary tumours.


Background
Intraductal papilloma (IP) is a benign tumour found within breast ducts that accounts for approximately 10% of benign breast lesions [1]. The abnormal proliferation of ductal epithelial cells causes such growth, which occurs most commonly in women between 35-55 years of age [2]. Hormones, fertility, and diet are predisposing risk factors that may lead to the development of IPs [3]. Because IP is related to atypia, ductal carcinoma in situ (DCIS) and carcinoma, it is classi ed as a high-risk precursor lesion with a 6.3% risk of being malignant [3] and may be upgraded to atypical ductal hyperplasia or DCIS upon surgical excision [4]. However, the mechanism of breast neoplasm is not fully understood, and multidimensional molecular data from IP patients has not been fully integrated in studies on this topic.
The results of DNA sequence research con rm that tree shrews are close relatives of primates [5]. Tree shrews have become an increasingly popular experimental animal model for a variety of human tumour diseases, including lung cancer [6], hepatocellular carcinoma [7] and glioblastoma [8]. The genome sequencing of Chinese tree shrews was accomplished in 2013 and has provided a useful resource for functional genomic studies [9]. A database based on the genome sequence data of tree shrews has been established [10]. Most importantly, in terms of morphology and structure, the mammary glands of tree shrews are similar to human glands [11]. Based on the above advantages, tree shrews are ideal experimental animals for studying the pathogenesis of mammary tumours. However, there are few studies on gene expression patterns and the underlying function of DNA methylation in the tumorigenesis of spontaneous IP in tree shrews as a novel breast tumour animal model. DNA methylation is one of the epigenetic changes that has been shown to play an important role in the pretranscription regulation and inhibition of gene expression in many mammalian genomes. The mapping of genome-wide DNA methylation is of great importance for understanding tumorigenesis [12]. This modi cation participates in many cancers, including thyroid cancer [13], non-small cell lung cancer [14], gastric cancer [15], etc. DNA methylation is involved in the development and progression of breast cancer [16]. Likewise, limited evidence has suggested that the aberrant methylation of cytosine residues is involved in the development of IPs [17]. Therefore, it is helpful to understand the tumorigenesis of papilloma from the perspective of epigenetic regulation by depicting the methylation pro le and identifying differentially methylated genes (DMGs).
It is generally believed that the abnormal reprogramming of the whole transcriptome, including genes, long noncoding RNAs (lncRNAs) and circular RNAs (circRNAs), is a crucial process in tumour occurrence and progression. Recently, RNA sequencing (RNA-seq) studies on breast cancer [18] have been conducted to provide a deeper understanding of the mechanisms involved, and research on the potential underlying molecular mechanisms that in uence the occurrence and development of breast cancer has been performed in murine mammary tumour models [19]. By analysing methylome and transcriptome variations related to the survival status of breast cancer, we can further understand the basic biological process of breast cancer based on the genetic aetiology [20]. Moreover, the results of the analysis of DNA methylation and gene expression have demonstrated that the methylation level of CpGs in breast cancer tissues is signi cantly higher than that in adjacent normal tissues. Additionally, large numbers of CpGs show a signi cantly higher methylation level than that found in nearby normal tissues, which is negatively correlated with gene expression [21]. Thus, combined with methylation data and gene expression pro le data, we can better analyse the regulatory function of methylation to solve existing problems.
However, compared to studies in malignant breast cancer, studies on DNA methylation and the transcriptome in IPs are lagging behind. Therefore, we carried out an integrated analysis of genome-wide DNA methylation levels and the whole transcriptome in breast IP in tree shrews in the present work. Our study provides new insights into IP in tree shrews, highlights candidate tumorigenesis-eliciting genes, and will contribute to the application of tree shrews as breast tumour animal models.

Results
In the basic pathology of tree shrew breast tumours, the ductal epithelium of the breast showed papillary hyperplasia, the nipple varied in size, and the cells showed no atypia, similar to human pathology. The tumours are expected to be benign (Fig. 1A). In contrast, the normal mammary glands showed a normal structure of the breast acini and ducts, and there was no degeneration, necrosis, or in ammatory cell in ltration (Fig. 1B). As shown in Fig. 1, we con rmed that 3 of the spontaneous mammary tumours collected from females in the closed colony of tree shrews were breast IPs. The maximum size of the tree shrew spontaneous breast IPs was 5 cm x 4 cm. As shown in Table 1, the selected tree shrews were divided into two groups. The IP group (n = 3) consisted of tree  shrews with IPs (IP-1, IP-2 and IP-3), whereas the Control group (n = 3) consisted of tree shrews with healthy mammary gland tissues (Control-1, Control-2 and Control-3). These results indicated that the selected IP tree shrews and normal tree shrews were appropriate for subsequent analyses. in the IP group. Thus, the proportions of these contexts were analogous between groups ( Fig. 2A).
Epigenetic variation among DNA sequences, particularly CpG DNA methylation, is an important type of variation that modulates gene expression under physiological and pathological conditions. Our results showed that CG methylation levels were higher while CHG and CHH methylation levels were lower and that the number of mCG sites was increased in the IP group compared with the Control mammary gland tissues, while other types of sites did not differ The obtained DNA methylation pro les demonstrated that the methylation level in the CG context was higher than those in the CHG and CHH contexts. The DNA methylation level in the CG context was highest in the gene body region. DNA methylation was moderately high in the upstream 2K start site, decreased dramatically from the upstream 2 k region to the TSS, increased sharply from the TSS to the gene body region, maintained the highest level in the gene body region and then decreased slightly in the downstream 2K region (Fig. 2B).
We identi ed 3,486 differentially methylated CG regions, 82 CHG regions, and 361 CHH regions. In the CG context, 3,486 differentially methylated regions (DMRs), located in 701 genes, were identi ed between the IP and Control groups (Q < 0.05), among which 705 showed increased methylation and 2,781 showed decreased methylation in IP tissues compared with the levels in Control tissues (Additional le 1: Table S1). All DMRs were used for the comparison of differences, as shown in Fig. 3A. We analysed CG methylation sites in tree shrews in the IP and Control groups using hierarchical clustering. The results revealed separation between the two groups.
CG site methylation levels for the total methylated sites and DMGs in IP tumours (69.47%) were decreased compared with those in mammary glands (74.18%). Among the genes containing DMRs, 607 genes were located within the gene body, including 259 upregulated genes and 357 downregulated genes.
A total of 58 genes were located in the upstream 2K region, whereas 55 were located in the downstream 2K region (Fig. 3B) (Additional le 1: Table S1).
To further elucidate the biological functions of DMGs, we performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. Based on GO term analysis, the putative target genes of KLF5 were associated with terms such as developmental process (P adjust = 0.0056) and single-organism developmental process (P adjust = 0.0103) in the biological process (BP) category; binding (P = 0.0063) in the molecular function (MF) category; and membrane (P = 0.0045) in the cellular component (CC) category ( Fig. 3C) (Additional le 1: Table S1). KEGG pathway analysis indicated that 15 DMGs were associated with the oxytocin signalling pathway (Q = 0.0027), while 10 DMGs were associated with the oestrogen signalling pathway (Q = 0.0113) ( Fig. 3D) (Additional le 1: Table S1). Taken together, the gene annotation results revealed that the majority of differentially genes showing increased methylation and decreased methylation were mapped to the gene body region, suggesting that DMGs play crucial roles in IP.
Differentially expressed genes (DEGs) between the IP group and normal mammary gland tissue To systematically describe the transcriptome landscape of the IP group and normal mammary gland tissues, whole-transcriptome sequencing was performed. After removing low-quality reads from each library, the clean reads were combined and aligned with the Tupaia chinensis genome (TupChi_1.0), resulting in the identi cation of 10,051 known mRNAs, 25,481 new mRNAs, 1,022 known lncRNAs, 1,617 new lncRNAs, and 10,602 new circRNAs in IPs (IP group) and normal mammary glands (Control group) in total. Furthermore, we compared the transcriptomic landscapes of the normal mammary gland and IP tumour tissues of tree shrews. We identi ed 39 upregulated DEGs and 23 downregulated DEGs in IP tumours compared with normal mammary glands (FDR < 0.05 & |log2FC|>1) (Fig. 4A). Among these DEGs, the expression of ST14, PSMF1, and TNFSF1 was more than 15-fold higher in the IP group. In normal mammary gland tissue, the levels of FAM192A and Psmc5 were 15-and 7-fold higher, respectively. More detailed information is listed in Additional le 1: Table S2.
Among these DEGs, GO analysis indicated that TNFSF11 participates in the tumour necrosis factor-mediated signalling pathway and the cellular response to tumour necrosis factor (P = 0.0042); GATA3, EPHA2, and PTPRG are involved in the regulation of epithelial cell migration (P = 0.0042); STAG2, TEX14, PPP1R9B, and TPR are involved in the regulation of cell cycle processes (P = 0.0087); and TNFSF11, and EPHA2 are involved in epithelial cell proliferation (P = 0.0091) (Additional le 1: Table S2). ACSL1 and APOA5 participate in the PPAR signalling pathway (P = 0.0367) according to KEGG analysis (Additional le 1: Table S2). Furthermore, we utilized gene set enrichment analysis (GSEA) to analyse groups of functionally relevant genes in IPs compared to normal mammary glands. A total of 107 GO terms and 66 pathways (FDR < 0.05) were indicated to be signi cantly enriched in dysregulated genes in IPs (Additional le 1: Table S2). The upregulation of cell cycle phase-and S phase-related terms in the biological phase category and mitotic sister chromatid segregation and sister chromatid segregation in the cellular component organization or biogenesis category was observed under the BP category in the IP group (Additional le 1: Table S2). Among the DEGs identi ed in the IP group, the upregulated gene sets were involved not only in replication and repair, including DNA replication, mismatch repair and nucleotide excision repair, but also in cell growth and death, including the cell cycle; the downregulated gene sets were involved in the endocrine system, including the renin secretion and PPAR signalling pathways (Additional le 1: Table S2). Overall, many of the DEGs identi ed in IPs were found to be involved in tumorigenesis, as demonstrated by the enrichment analysis of GO terms and pathways.
The lncRNA expression pro le is distinct between IPs and normal mammary gland tissue in tree shrew Compared with the results in normal mammary gland tissue samples, 50 differentially expressed lncRNAs (DElncRNAs), including 39 upregulated and 11 downregulated lncRNAs, were identi ed in the IP tissue samples (Fig. 5A). The most signi cantly upregulated lncRNAs were TCONS_00002926, TCONS_00077528, XR_001369927.1, TCONS_00037489, and XR_333956.1, whereas TCONS_00016941, XR_334171.1, TCONS_00135842, TCONS_00077456 and TCONS_00094673 were downregulated lncRNAs identi ed in the IP group (Additional le 1: Table S3). To reveal the function of the lncRNAs, we predicted the complementary correlation of antisense lncRNAs and mRNAs. All antisense target genes of the DElncRNAs were subjected to GO and KEGG pathway analysis to determine their functions. A variety of relevant GO terms in the BP category were observed (Additional le 1: Table S3), such as blood vessel endothelial cell migration (P = 3.99E-05), epithelial cell migration (P = 0.0012), epithelium migration (P = 0.0012), and negative regulation of apoptotic process (P = 0.0016). KEGG pathway analysis revealed that the Jak-STAT signalling pathway (P = 0.0087), PPAR signalling pathway (P = 0.0233), and oestrogen signalling pathway (P = 0.0292) were associated with DElncRNAs (Additional le 1: Table S3).
The second function of lncRNAs, when located less than 10 kb upstream/downstream from a gene, is to act as cis-regulators of their neighbouring genes on the same strand. A large number of enriched GO terms were observed in the BP category, including branching morphogenesis of an epithelial tube (P = 0.0001), serine phosphorylation of STAT protein (P = 0.0002), negative regulation of cell growth (P = 0.0017), and execution phase of apoptosis (P = 0.0019) (Additional le 1: Table S3). KEGG pathway analysis revealed that DElncRNAs were associated with the pathways of protein processing in the endoplasmic reticulum (P = 0.0017), glyoxylate and dicarboxylate metabolism (P = 0.0018), antigen processing and presentation (P = 0.0250), and the NOD-like receptor signalling pathway (P = 0.0483) (Additional le 1: Table S3).
The third function of lncRNAs is the trans-regulation of non-adjoining co-expressed genes. We analysed the correlation of expression between lncRNAs and mRNAs to reveal the target genes of lncRNAs and performed GO function and KEGG pathway enrichment analyses of protein-coding genes with an absolute correlation greater than 0.9 (Additional le 1: Table S3). The enriched GO functions of the DElncRNA trans-regulated co-expressed genes are listed in Table S3 and included mesenchymal stem cell differentiation (P = 0.0032), extrinsic apoptotic signalling pathway (P = 0.0070), recombinational repair (P = 0.0086), and the execution phase of apoptosis (P = 0.0125) in the BP category. The DElncRNA trans-regulated co-expressed genes were signi cantly enriched in KEGG pathways including homologous recombination (P = 0.0004), histidine metabolism (P = 0.0022), and the renin-angiotensin system (P = 0.0244) (Additional le 1: Table S3). These dysregulated lncRNAs might be involved in IP tumorigenesis in tree shrews by regulating genes and signalling pathways implicated in tumorigenesis.
The gene of origin of a circRNA is its parental gene. We carried out a functional enrichment analysis of parental genes to investigate the putative functions of differentially expressed circRNAs. GO analyses revealed that the parental genes of the dysregulated circRNAs were enriched in the terms branch elongation of an epithelium (P = 0.0161), negative regulation of cell proliferation (P = 0.0173), positive regulation of epithelial cell proliferation (P = 0.0233), vasculogenesis (P = 0.0276), and gland morphogenesis (P = 0.0290) in the BP category. The PI3K-Akt signalling pathway (P = 0.0139) and Ras signalling pathway (P = 0.0509) were identi ed in the KEGG enrichment analysis (Fig. 6B. 6C, Additional le 1: Table S4). Taken together, these ndings suggest that these DEcircRNAs that in uence gene expression may affect IP tumour development.

Integration analysis of the methylome and transcriptome
To examine whether differences in DNA methylation between the IP and healthy mammary gland tissues of tree shrews could be the basis of the observed gene expression differences, we analysed the correlation between the gene expression and DNA methylation data in the IP group against that in the normal mammary gland group, and the results indicated a considerable regulatory effect of DNA methylation on the modulation of gene expression. The methylation levels in the upstream 2K (Pearson's R = -0.0383 for IP group; and Pearson's R = -0.0192 for Control group), gene body (Pearson's R = -0.0922 for IP group; and Pearson's R = -0.0737 for Control group) and downstream 2K regions in the Control group (Pearson's R = -0.0013 for Control group) were negatively correlated with expression levels, but those in the downstream 2K region were positively correlated in the IP group (Pearson's R = 0.0030 for IP group) (Fig. 7A. 7B).
Additionally, DMGs and DEGs were compared through integrated methylomic and transcriptomic analysis, which showed 25 differentially methylated and expressed genes according to both RNA-seq (P < 0.05) and WGBS (Q < 0.05) (Fig. 7B and Additional le 1: Table S5). Among these genes, the gene numbers exhibiting DMRs in the upstream 2K, gene body and downstream 2K regions were 1, 23 and 1, respectively. The GO analysis of DEGs and DMGs showed that 15 genes could show enrichment in 213 BP terms (P < 0.05) (Additional le 1: Table S5). Among these genes, ATP2B4 was involved in the positive regulation of transmembrane transport (P = 0.0008), calcium-mediated signalling (P = 0.0022), the negative regulation of catabolic processes (P = 0.0031), and the negative regulation of calcium-mediated signalling (P = 0.0061); PDZK1 was involved in the positive regulation of transmembrane transport (P = 0.0014), positive regulation of transport (P = 0.0031) and regulation of transmembrane transport (P = 0.0126); LCP1 was involved in the regulation of intracellular transport (P = 0.0034) and the regulation of cellular localization (P = 0.0151); and PDZK1 and LCP1 are involved in the regulation of transport (P = 0.0135). The KEGG analysis of DEGs and DMGs showed that 9 genes were enriched in 19 signalling pathways (P < 0.05) (Additional le 1: Table S5). Subsequent analysis identi ed 9 genes with an inverse relationship between their degree of DNA methylation and gene expression in gene body regions (Table 2), which were related to signal transduction and the endocrine system. Three differentially upregulated and downregulated genes (ADCY5, ATP2B4, and CREB5) were associated with signal transduction pathways, including the cAMP signalling pathway (P = 3.13E-06), cGMP-PKG signalling pathway (P = 6.76E-05), TNF signalling pathway (P = 0.0118), and AMPK signalling pathway (P = 0.0131). Furthermore, ADCY5 and CREB5 were involved in insulin secretion (P = 0.0001) and the oestrogen signalling pathway (P = 0.0002) in the endocrine system. Thus, we integrated the gene expression and DNA methylation maps and identi ed protein-coding genes with underlying changes related to DNA methylation in IPs; the resulting alterations probably induced the development of mammary tumours.

Discussion
Herein, we provide an expanded overview of DNA methylation levels and transcriptome characteristics in 3 tumour tissue samples of spontaneous breast IP and normal mammary gland tissues from tree shrews in which gene expression was analysed. Tree shrews are currently considered an animal model for studying mammary tumours, including both spontaneous and induced models. Kazuhiro Daino et al. used a microarray to obtain the DNA methylation and expression pro les of γ-ray-induced mammary carcinomas in rats [22]. We obtained the genomic DNA methylation pro les of normal mammary gland and IP tissues of tree shrews in this context. DNA methylation levels were downregulated in tree shrews with IP compared with healthy tree shrews. In addition, GATAbinding protein 3 (GATA3) was upregulated 12.2-fold in IP tissues compared with normal mammary gland tissues. However, no difference in DMRs was found.
The different results between these two studies may be due to the differences in the experimental animals and types of mammary tumours examined.
Shao et al identi ed 17 Krüppel-like factors from Chinese tree shrews. KLF5 encodes a member of the zinc nger protein KLF subfamily that acts a transcriptional activator by binding to a speci c recognition motif directly in the promoters of target genes to play roles in both promoting and suppressing cell proliferation. Tupaia belangeri (Tb) KLF5, similar to human Krüppel-like factor (hKLF) hKLF5, signi cantly promotes cell proliferation, playing a proproliferative and oncogenic role in breast cancer [23]. These ndings suggested that tree shrews may serve as alternative animal models for breast cancer related to KLF5 [24]. In this study, KLF family members were not included among the DEGs, but DNA methylation analysis showed that KLF5 presented signi cant upregulation among the identi ed DMRs.
To obtain deeper insights into the molecular mechanisms of tumorigenesis, we carried out a correlation analysis of the RNA-seq and WGBS data. The results suggested an inverse correlation of PDZK1, ATP2B4 and LCP1 gene expression with DNA methylation between the IP group and Control groups. PDZ domaincontaining 1 (PDZK1) encodes a PDZ domain-containing scaffolding protein [25]. Genome-wide association studies (GWASs) of a large cohort identi ed rs12405132 of PDZK1 as a new susceptibility locus for breast cancer; hence, PDZK1 is a potential interacting gene in breast cancer [26]. In primary breast cancers, PDZK1 is an oestrogen-regulated gene that is overexpressed in ER-positive breast cancers [27]. PDZK1 has been as a marker of oestrogen-regulated gene (ERG) expression in examining the relationship between the menstrual cycle and oestrogen receptor-positive breast cancer [28]. Dunbier et al. further veri ed that the expression of PDZK1 was strongly related to plasma oestradiol (E2) levels in postmenopausal patients with primary ER-positive breast cancer [29]. PDZK1 exhibits epithelial expression with a primarily cytosolic subcellular localization, and PDZK1 expression is indirectly modulated by ER-α stimulation [30].
ATPase Plasma Membrane Ca 2+ Transporting 4 (ATP2B4) encodes plasma membrane calcium ATPase isoform 4 (PMCA4b), which belongs to the P-type primary ion transport ATPase family. The PMCA4b (ATP2B4) protein is located primarily in the plasma membrane, is expressed in normal breast tissue, and plays an important role in the plasma membrane Ca 2+ pump in the maintenance of mammary epithelial Ca 2+ homeostasis [31]. The PMCA4 protein is present in the normal breast ductal epithelium; however, a variety of factors, including hormonal imbalances, epigenetic modi cations and impaired protein tra cking, may lead to PMCA4b loss in breast cancer [32]. The same study showed that the regulation of Ca 2+ signalling by increased PMCA4b expression may be conducive to the normal development of the breast epithelium. Consistent with the results of this previous study, the expression of ATP2B4 mRNA was downregulated in our IP group by 7.3-fold, and methylation was upregulated. In breast cancer treatment, the targeting of PMCA4 may enhance the effectiveness of breast cancer therapies that act through the promotion of cell death pathways [33]. The targeted regulation of PMCA4 functionality may give rise to novel therapeutic methods to attenuate or facilitate new vessel formation in breast cancer, which is associated with angiogenesis [34].
Lymphocyte cytosolic protein 1 (LCP1) is an L-plastin protein-coding gene that is a member of the actin-binding protein family, which is conserved during eukaryote evolution and is expressed in the majority of tissues of higher eukaryotes. LCP1 plays a critical role in the activation of T-cells, associated with NF-kappaB signalling, calcium ion binding and actin binding. In addition, the expression of L-plastin is induced concomitant with tumorigenesis in solid tissues. A negative effect of LCP1 on breast cancer progression has been proven, and the inhibition of LCP1 results in breast cancer cell migration, invasion, and proliferation [35]. Mutations in LCP1 have been reported as putative cancer drivers on the basis of whole-exome sequencing in independent benzo[a]pyrene (BaP)-derived post-stasis human mammary epithelial cell strains [36]. L-plastin is a protein that exerts a cell-protective effect against TNF cytotoxicity in breast cancer cell lines [37]. The actin-binding protein LCP1/L-PLASTIN has been veri ed to participate in CXCL12/CXCR4 signalling in breast cancer cells [38].
Therefore, we concluded that PDZK1, ATP2B4, and LCP1 might be key regulatory genes during the development of spontaneous IP in tree shrews. In addition, the DNA methylation of these genes may be a crucial functional regulator of tumorigenesis. Nevertheless, the epigenetic mechanisms participating in the modulation of these genes as well as genetic regions associated with the development of IP require further exploration.

Conclusions
Overall, our ndings systematically demonstrated the changes in mRNA, lncRNA, and circRNA and allowed the characterization of the genome-wide DNA methylation pro les of IPs and normal mammary glands in tree shrews, providing valuable evidence for better understanding the development of mammary tumours. Our research also showed that DNA methylation in uences the expression of genes associated with the development of spontaneous IP in tree shrews. Such analyses greatly improve the progress in exploring the characteristics of DNA methylation in the development of breast IP and provide new directions for the study of epigenetic markers and target genes for spontaneous mammary tumours.

Methods
Tissue specimens from tree shrews and their histology In total, six tree shrews were obtained from the Institute of Medical Biology, Chinese Academy of Medical Science (IMB-CAMS), in Kunming. Animal experiments were approved by the animal ethics committee of IMB-CAMS (animal ethics approval number: DWSP201809003).
Both normal breast tissues and breast tumours were isolated after animals were euthanasia with intraperitoneal injection of pentobarbital sodium (100 mg/kg). The mammary tumours were surgically removed, and the size of each tumour was then measured, and its volume was calculated as V = π/6 × (a) 2 (b), (where a and b represent the shortest and longest transverse diameters, respectively) [39]. A portion of the tissue samples was dissected, xed in a 4% paraformaldehyde solution, embedded in para n, and stained with HE before being histologically diagnosed by pathologists. Another portion of the tissue samples was immediately frozen in liquid nitrogen and stored at -80 °C for subsequent experiments.

Library construction and WGBS
Genomic DNA was extracted from the samples via the cetyltrimethylammonium bromide (CTAB) method, and the DNA concentration and integrity were by using a NanoPhotometer® spectrophotometer (IMPLEN, CA, USA) and agarose gel electrophoresis, respectively. Then, DNA libraries for bisulphite sequencing were prepared. Brie y, genomic DNA was fragmented into 100-300 bp fragments by sonication (Covaris, Massachusetts, USA) and puri ed with a MiniElute PCR Puri cation Kit (QIAGEN, MD, USA). The fragmented DNAs were end repaired, and a single "A" nucleotide was added to the 3' end of the blunt fragments.
Then, the genomic fragments were ligated to methylated sequencing adapters. Fragments with adapters were bisulphite converted using the Methylation-Gold kit (ZYMO, CA, USA), and unmethylated cytosines were converted to uracils through sodium bisulphite treatment. Finally, the converted DNA fragments were PCR ampli ed and sequenced using an Illumina HiSeqTM 4000 PE 150 instrument.

Methylation level analysis
After data ltering, the acquired clean reads were mapped to the Tupaia chinensis (Chinese tree shrew) reference genome (TupChi_1.0) (GCF_000334495.1) using BSMAP software (v2.90) [40]. Then, a custom Perl script was applied to call methylated cytosines, and a correction algorithm was applied to the methylated cytosine results [41]. Methylation levels were calculated according to the methylated cytosine percentage in the global genome as well as in variant regions of the genome for each sequence context (CG, CHG and CHH). To estimate variant methylation patterns in variant genomic regions, the methylation pro les of the anking 2 kb regions as well as the gene body were plotted based on the average methylation levels for each window.

DMR analysis
To investigate DMRs between two groups, the minimum read coverage to call the methylation status of a base was set to 4. The DMRs in each sequence context (CG, CHG and CHH) were identi ed according to the following criteria: for CG, CHG, CHH and all C, the number in each window had to be ≥ 5, 5, 15, and 20, respectively; the absolute value of the difference in the methylation ratio had to be ≥ 0.25, 0.25, 0.15 and 0.2; and q ≤ 0.05 was required for all. GO and KEGG pathway enrichment analyses were performed for DMR-related genes to explore the functional enrichment of genes in uenced by DMRs.
Strand-speci c library construction and whole-transcriptome sequencing First, total RNA was extracted using TRIzol, and rRNAs were removed to retain mRNAs and ncRNAs. The enriched mRNAs and ncRNAs were fragmented into short fragments by using fragmentation buffer and reverse transcribed into cDNA with random primers. Second-strand cDNA was synthesized by using DNA polymerase I, RNase H, dNTP (dUTP instead of dTTP) and buffer. Then, the cDNA fragments were puri ed with a QiaQuick PCR extraction kit, end repaired, subjected to poly (A) addition, and ligated to Illumina sequencing adapters. Next, UNG (uracil-N-glycosylase) was used to digest the second-strand cDNA.
Finally, the digested products were size selected by agarose gel electrophoresis, PCR ampli ed, and sequenced on the Illumina HiSeq TM 2000 platform.
Alignment with ribosomal RNA (rRNA) and the reference genome After the ltering of clean reads, Bowtie2 (v2.2.8) [42] was applied for mapping reads to the rRNA database, removing the rRNA mapped reads. The remaining reads were used for subsequent assembly and analysis. Moreover, the rRNA-removed reads were mapped to TupChi_1.0 (GCF_000334495.1) by TopHat2 (v2.1.1) [43]. After alignment with TupChi_1.0, unmapped reads (or mapped very poorly) were aligned with Bowtie2 again; the reads that mapped to the genomes were removed, and the unmapped reads were collected for circRNA identi cation. In addition to identifying expressed genes and their quantitative expression, sequence alignment is helpful for discovering new transcripts.

Transcript reconstruction
The reconstruction of transcripts was performed with Cu inks software [44], together with TopHat2, to identify new genes and new splice variants of known genes. The program reference annotation-based transcripts (RABT) was previously applied. Cu inks constructed faux reads in terms of reference to make up for the effect of low-coverage sequencing. Finally, all reassembled fragments were aligned with reference genes, and similar fragments were discarded.

Novel transcript identi cation and annotation
To identify new transcripts, all reconstructed transcripts were aligned to TupChi_1.0. We applied the following parameters to identify predictable novel genes: length of transcript > 200 bp and exon number > 2. To acquire functional annotations of novel transcripts, alignment to the Nr, KEGG, and GO databases was performed.

CircRNA identi cation and database annotation
From both ends of the unmapped reads, 20-mers were extracted and aligned to TupChi_1.0 to identify unique anchor positions within splice sites. Moreover, anchor reads that aligned in the reversed orientation (head-to tail) and showed circRNA splicing were subjected to nd_circ [45] analysis to identify circRNAs.
Anchor alignments were extended such that the complete read aligned and the breakpoints were anked by GU/AG splice sites. A candidate circRNA was called if it was supported by at least two unique back-spliced reads. Furthermore, the identi ed circRNAs were subjected to statistical analysis of their type and length distribution. Finally, circRNAs were subjected to BLAST searches against circBase [46] for annotation, and those that could not be annotated were de ned as novel circRNAs.
LncRNA prediction and analysis CNCI (v2) [47] and CPC [48]were applied to evaluate the protein-coding potential of novel transcripts according to default parameters. Novel transcripts were also mapped to the SwissProt database to obtain protein annotations. Those showing the intersection of neither protein-coding potential nor protein annotation results were chosen as lncRNAs. To investigate the interaction between antisense lncRNA and mRNA, RNAplex [49] was used to perform complementary correlation analysis. The program contains the ViennaRNA package [50] and predicts the best base pairing according to thermodynamic structure on the basis of the calculation of minimum free energy.
Quanti cation of transcript abundance and circRNA abundance Transcript abundance was quanti ed by RSEM [51]. The fragments per kilobase of transcript per million mapped reads (FPKM) method was used to normalize transcript expression levels. In addition, the reads per million mapped reads (RPM) method was used to scale back-spliced junction reads to quantify circRNAs.

Analysis of differentially expressed transcripts and differentially expressed circRNAs
Differentially expressed mRNAs, lncRNAs and circRNAs were identi ed. To determine differentially expressed transcripts and circRNAs between two groups, the edgeR package was used. In each comparison, we identi ed mRNAs with a fold change ≥ 2 and FDR < 0.05 as DEGs and circRNAs/lncRNAs with a fold change ≥ 2 and P value < 0.05 as differentially expressed circRNAs/lncRNAs. The relevant coding RNAs were subjected to GO function and KEGG pathway enrichment analysis.
Correlation of DNA methylation and gene expression including the ± 2 kb anking regions and gene body region. Spearman correlation analysis was used to statistically identify the relationships between gene expression and DNA methylation within the gene body and ± 2 kb anking regions. Rho < 0 indicates a negative correlation and Rho > 0 a positive correlation.
To investigate the underlying functions of DNA methylation responsible for differential gene expression, the common genes between the DMR-related genes and DEGs were analysed, and GO and KEGG pathway enrichment analyses were conducted for DEGs with DMRs.
GO and pathway enrichment analysis GO enrichment analysis recognizes the key biological functions of genes by providing all GO terms that are signi cantly enriched in genes compared to the genomic background, in addition to ltering genes that correspond to biological functions. Moreover, genes generally interact with each other to carry out speci c biological functions. Pathway-based analysis contributes to further identifying gene biological functions. KEGG is the primary public pathway-related database [52]. All genes were mapped to GO terms in the GO database, gene numbers were calculated for every term, and signi cantly enriched GO terms were identi ed. Pathway enrichment analysis revealed signal transduction or metabolic pathways that were signi cantly enriched in genes compared to the genome background. The calculated P values were subjected to FDR correction, taking a P value ≤ 0.05 as a threshold. GO terms/pathways meeting this criterion were de ned as signi cantly enriched GO terms/pathways in genes.

GSEA
We carried out gene set enrichment analysis using GSEA software [53] to discern whether a set of genes enriched in distinct GO terms/pathways showed signi cant differences between two groups. In brief, we input a gene expression matrix and ranked genes by using the SinaltoNoise normalization program.
Enrichment scores and p values were calculated using default parameters.