Transcriptome signature of the lactation process in a fat-tailed sheep identifies with integrative approach of RNA-Seq and Supervised Machine Learning models

The mammary gland is a vital organ in mammalian that undergoes a substantial physiological transformation during the lactation process. Understanding the underlying molecular mechanism of the lactation process in the mammary gland is essential to unravel of the complexity of milk production process. This study was investigated the transcriptome profiles of milk between two time point of lactation (day 20 as before peak time point (BF) and day 70 as after peak(AF)) in Iranian fat-tailed Ghezel sheep breed. Functional impacts of differentially expressed genes (DEGs) between two time points of the lactation were surveyed with Gene Ontology (GO) and Protein-Protein Interaction (PPI) network analysis. Moreover, to identify the transcriptome signature of the lactation process, supervised machine learning algorithms were integrated. of mainly enriched in metabolic Moreover, networks analysis the contribution More interestingly, the genes involved in metabolism nineteen attribute ranking algorithms, signature process in Ghezel sheep.

3 insight regarding the transcriptional responses for the lactation process.

Background
Milk is a white liquid food and nutrient-rich that is consumed by mammalian infants and secret by mammalian during the lactation period. The lactation stage can be divided into three phases; the early, mid and late lactation [1]. Human and cow milk [2] and sheep milk [3] have a significant difference in the physicochemical features. Also, there are many differences between each stage of lactation processes in different phases. Fat and protein contents of sheep milk are higher than goat and cow milk. Moreover, in comparison with cows, buffaloes, and goats milks, lactose content of sheep milk is higher [4].
Milk contains a different percentage of fat, protein, lactose, casein and whey protein in early, mid and late lactation in sheep [3], cow and human [5]. These sheep milk properties make it appropriate for dairy products such as yoghurt and cheese making.
The milk transcriptome has been characterized in human [6,7], bovine [8,9] and sheep [10]. In Spanish sheep breed the number of 573 DEGs introduced across lactation points (10, 50, 120 and 150 days), with the most considerable divergences being found, between day 10 and day 150 [10]. It has been also reported that immune and stress responsive genes are important pathways in the lactation process [10].
Transcriptome profiling technologies, such as cDNA library, microarray, and RNA-sequencing is versatile tools to get new insight of underlying mechanism of different biological functions and cellular activities [11][12][13][14][15][16]. To tarnscriptome level study of lactation process, mammary gland biopsies Among them, MFG transcriptome represents the RNA profiles from mammary epithelial cells (MEC) and mammary gland in lactation processes studies [7,9,17,18].
Sheep milk is mainly used for cheese production, so its quality directly influences cheese yield [19].
Ghezel sheep has originated from northwestern provinces of Iran (East and west Azerbaijan) and west of Turkey and is commonly tolerated to dry and cold circumstances [20]. Ghezel sheep's is commonly used for its dairy products characteristics, Lighvan cheese is a famous traditional 4 cheese made from Ghezel sheep milk in the village of Lighvan located in the northwest Iran, East Azerbaijan province [21]. So, it is an ideal model breed for studding the milk production underlying mechanisms.
In this study, RNA-Seq transcriptome profiling were integrated with Machine Learning modeling to obtain the signature of two distinct time point of lactation (before peak and after peak lactation) in Ghezel sheep.

Animals and sampling
Two healthy Ghezel sheep at the second lactation stage were involved in this study. These sheep were kept in free stall housing; feeding with the same rations and without water restrictions. Milk samples were collected with a hand at two lactation stages: the before peak stage (on day 20 after lambing) and after peak stage (on day 75 after lambing). Collected samples were stored in sterile and RNase-free 50-ml tubes for MFG extraction.

Collection of MGFs and RNA isolation
Samples were centrifuged at 2000 g for 10 min at four 0 C in order to isolate milk fat. The supernatant fat layer was transferred to a new RNase-free 15-ml tube using an RNase-free spoon. A 500-µl quantity of fat was put into a 15-ml tube, and 1.5 ml of Trizol reagent (Invitrogen Life Technology) was added to dissolve the fat layer. Total RNA from MFG was extracted by using a TRIzol LS solution (Invitrogen Life Technologies Inc.) following the manufacturer's protocol, with minor modifications.
Briefly, to remove lipids, centrifugation was performed at 12,000 × g for 10 min at 4°C. A 400-μL quantity of chloroform was then added to the remaining solution and the mix was shaken vigorously.
After centrifugation, the upper aqueous layer (approximately 60%), which were contained RNA, was carefully pipetted, and RNA was precipitated by adding 750 μL of isopropanol/1 mL of TRIzol LS used for extraction. After centrifugation (12,000 × g for 10 min at 4°C), RNA pellets were washed with 75% ethanol, dissolved in 30 μL of RNase-free water, and finally stored at -80°C until further processing.
Genomic DNA contaminations were removed by DNase I (Qiagen, Hilden, Germany). Then, the quality and quantity of isolated RNA were surveyed with agarose gel electrophoresis and Nano drop 5 spectrophotometry, respectively.

RNA sequencing and Data analysis
The RNA integrity value (RIN) of the samples selected to be sequenced ranged between 7 and 8.
Paired-end libraries fragmented into short fragments (200-700 bp) by using the fragmentation buffer.
The fragments were sequenced on an Illumina Hi-Seq 4000 sequencer, according to the manufacturer's instructions (BGI). For each library, 60 million paired-end 100 bp reads were generated during the sequencing run. The quality of raw data was evaluated using FastQC (Version

Gene Ontology enrichment of DEGs
To analyze the functions of DEGs, GO enrichment of DEGs was performed in three Biological processes, molecular function and cellular component categories [25]. The adj-value ≤ 0.05 was set as cutoff threshold for enrichment analysis.

Protein-Protein Network analysis
To better understanding of the dynamic process of a complex process such as lactation, information about the functions of genes/proteins is needed [15]. In this regard, the protein-protein interaction network of DEGs was firstly constructed using the STRING (https://string-db.org/) database.
Experimentally validated, text mining, databases, co-expression, and neighborhood interactions were considered for the network construction [25]. Then, to define the functional modules, constructed networks were clustered with the K means algorithm. K-means is a popular clustering algorithm which is widely used in anomaly-based intrusion detection. It tries to classify a given data set into k (a predefined number) categories [26].
Selection of transcriptome signature using the Attribute weighting algorithms To identify the transcriptome signature of the lactation process, attribute weighting algorithms were implied on genes with significant expression in two time point of lactation. First, the two-step standardization procedure was carried out on expression data as previously proposed [27][28][29][30]. Then, stage of lactation (AF and BF) and FPKM values of DEGs was set as label attribute target, respectively.
Ten attribute weighting algorithms including the Information Gain, Information Gain Ratio, Chi-Squared, Deviation, Rule, SVM, Gini Index, Uncertainty, Relief, and PCA were separately applied on the DEGs data set. The DEGs with weighting value higher than 0.7 were considered as signature [31-33].

Results
Summary statistics for the RNA-Seq data Milk samples of Ghezel sheep from two lactation time points including the D20 (BF period) and D75 of lactation (AF period) with two biological replicates were sequenced. A total of 243,063,312 paired-end reads, ~60 million reads per sample were obtained from transcriptome sequencing of milk samples (n=4). Alignment of the reads to the Ovis arise genome Oar_v4.1 was 73.25% at mean (Table 1). Table 1: General statistic of reads in pre and post-trimming and mapping results AP: After Peak, BP: Before Peak The mapping ratio for each samples were described in Table 2. According to the FPKM value, mapped genes were divided into the high, moderate and low expression for genes with 500> FPKM, 10<FPKM <500 and <10 FPKM, respectively. Genes with > 0.02 FPKM were considered as a significant expression [8]. Totally, 23148 and 22362 genes were defined as expressed genes in BF and AF, respectively.

Analysis of the some most expressed genes in each stage of lactation
The normalization index makes simplify the comparison of expressed genes within a sample [34], so FPKM index were used to determine the DEGs. Table 2 shows the number of genes with the highest FPKM (≥500) at BF and AF of lactation. Distribution of FPKM values were provided in Figure 1. It is apparent from figure 1 that ATP6, PAEP, CSN1S1, CSN2, LALBA, RPS29, CYTB and CSN1S2 genes have increased expression in progress of lactation. Whereas, decreased expression of ND1, ND6 and ZNF33B genes were shown in the AF in comparison with BF of lactation.

Differentially expressed genes between BF vs. AF
Differential expression analysis with edgeR package has defined the 75 genes as DEGs between two stages of lactation. Among these DEGs, 60 and 15 genes were up-and down-regulated, respectively

Gene ontology
To indicate possible functions of DEGs identified in the milk samples, the genes were classified into three main categories (biological process, molecular function, cellular component) according to the GO algorithm (Fig 3). In the biological process category, "oxidation-reduction process", "metabolic process", "cation transport", and "oxidative phosphorylation" dominantly enriched.

Gene network
Functional impacts of DEGs were also surveyed in system levels analysis with PPI network construction. The constructed network was consisting of 55 nodes (DEGs) and 182 edges which strength of interaction score set on > 0.4 (Fig 6). Disconnected nodes in the network were removed.
The generated network is clustered into three functional modules. The first module which highlighted with blue color, annotated in oxidative phosphorylation and metabolic pathways. The second module annotated as PPAR signaling pathway based on KEGG database.
Ribosomal pathway includes RPL and RPS gene family which they have an important role in mammary gland development [44,45] and also, reported that the growth of the mammary gland is a critical process in milk production [46].  Table 3.  The FPKM index distribution were applied for identify the DEGs between two time points of lactation.
Results of GO results, Oxidation-reduction (redox status) enriched in BP category, is the most important physicochemical features in quality and flavor determination of dairy products [50]. All of the genes involved in the "redox" process has up regulated in BF which milk fat is higher than the AF time point of lactation. Other term activates is "oxidative phosphorylation" process which contains a series of ATPases in the metabolic pathway to energy generation [51], and milk production is a process that needs high energy [52,53]. Less oxidative phosphorylation is a major mechanism contributing to the low milk production in mammalian. Some genes like COX and ND family genes which are involved in the oxidative phosphorylation pathway, have up-regulated in BF time point of lactation. It has been also demonstrated that redox status has a significant positive correlation with fat content in milk [54] which is in agreement with the results of this study. Oxidative phosphorylation process which main process for providing energy for milk production take place in mitochondrial [55].
In this study we used MFGs as a RNA source to the evaluation of milk profile gene expression. Milk fat global is the most important part of the milk and some milk fat globules secreted from the cytoplasm [56]. The enriched binding term in milk refers to binding some mineral and enzyme activity in the milk [57,58]. Results of GO analysis confirmed the functional role of the DEGs on milk production. The biological importance of "Redox" and "Oxidative phosphorylation" is in the quality of dairy product and energy generation for milk production.
Among the CN gene family, CSN2 gene showed the highest expression in Ghezel milk. It is in agreement with the caseins profiles which has been reported by [59]. Previous studies have been reported that during the lactation progress, expression of whey proteins and lactose synthesis in the Assaf and Churra sheep breed are not changed [8,10]. In contrast, results of our study showed that the CN and whey genes are down-regulated in BF time points. It has been indicated that the total milk protein from early to end lactation has increased rate [60]. It is may be due to cleaving of caseins and whey proteins are broken into bio-active peptides in the early lactation, so their concentration is not reflected in milk proteins profiles [61,62].
Based on the results of network analysis, "oxidative phosphorylation", is the main function for energy generation for milk production [55].PPAR pathway includes three subtypes ((PPARalpha, beta/delta, and gamma) which encoded by specific genes and bound fatty acids

Conclusion
To best of our knowledge, this is the first study that transcriptome of Ghezel sheep MFGs in two distinct lactation periods were profiled with RNA-Seq technology. Results of our analysis defined the 75 genes as responsive genes between two time points of lactation peak. Among them, caseins, whey proteins, lactose biosynthesis, and fat metabolism involved genes, have significantly up-regulated before lactation peak. Network analysis also highlighted the "oxidative phosphorylation", "metabolic pathways" and "PPAR signaling pathways" contribution in lactation process. Integration of data mining approaches with RNA-Seq technology proposed that the genes involved in energy generation,

Availability of data and materials
The datasets generated and analyzed during the current study are not publicly available due further analysis will be performed but are available from the corresponding author on reasonable request.
Please contact author for data requests.

Authors' contributions
All authors have read and approved the manuscript. MF: research concept and design, data analysis and interpretation, wrote the article, and final approval of the article. SR: wrote the article. EE and BP: data analysis and interpretation, critical revision of the article, and final approval of the article.      code is used to present the two time points studied, red color for AF (after peak) and green color for BP (before peak) of lactation period.

Ethics approval and consent to participate
26 Figure 2 Volcano plot of the expressed genes in the two before and after peak lactation groups. The red points indicate the differentially expressed genes. 27

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.