Genome-wide identification and expression analysis of the Collagen family during lactation in water buffalo (Bubalus Bubalis)

Background: Collagens, as extracellular matrix molecules, support cells for structural integrity and a variety of other functions, thereby contribute to support mammary basic structure and development. However, little information on the identification and expression profiles in response to the mammary gland of the collagen family in buffalo (Bubalus Bubalis) has been reported. Results: A total of 128 buffalo collagen protein sequences corresponding to 45 collagen genes were identified and classified into six clusters based on their phylogenetic relationships, conserved motifs, and gene structure analyses. A transcription factor binding sites (TFBS) analysis inferred that a total of 142 TFBS were predicted within the buffalo collagens, suggesting that different collagen subfamilies harbored different TFBS and played a variety of functions involved in the mammary gland development and lactation. The identified collagen sequences were unequally distributed on 17 chromosomes, 103 of which were determined to be tandem duplicated genes. Transcriptome data and qRT-PCR analysis revealed the expression diversity of buffalo collagen genes in various tissues. Most of the identified collagen genes were significantly up-regulated at the early lactation, 6 collagens upregulated at the peak lactation, and only COL24A1 was up-regulated at the late lactation. Conclusions: The present study provides significant insights into the potential functions of collagen family in dairy buffalo and helps in the functional characterization for collagen genes in further research. four lactation points. The results showed that 29 out of 45 collagen genes were mainly up-regulated at the D7, six collagens (COL6A6, COL3A1, COL1A1, COL27A1, COL8A1, and COL14A1) upregulated at the D50, two collagen genes (COL4A3 and COL10A1) were mainly expressed

4 generated and utilized for identifying the candidate genes related to traits of interest in buffalo, such as the potential genes related to milk or productive traits [19,20], transcriptome profiles of buffalo embryos with normal and retarded growth [21], and maternally-expressed proteins in buffalo oocyte [22]. All those data, along with the complete buffalo genome sequence [23], provide the possibility to perform the gene family analysis at a genome-wide level. In plants, numerous gene families have been identified and characterized, such as the sugar transporter gene family [24], KUP family [25], and heat shock factor gene family [26], which contribute to the understanding of gene structure, evolutionary patterns, and expression profiles. In animals, Körbes et al. [27] reported the molecular evolution of the lysophosphatidic acid acyltransferase (LPAAT) gene family, suggesting that the evolution of triacylglycerol biosynthesis shaped by diversification of the LPAAT genes. However, little information on the expression profiles of the collagen family in different tissues and developmental mammary gland of buffalo is available. Because of the vital functions of collagen in animals, it is of considerable importance to investigate the buffalo collagen family. In the present study, we identified collagen family from the buffalo genome and analyzed their evolutionary relationship, sequence features, chromosomal location, gene duplication, tissue-specific expression levels and dynamic expression patterns in response to different lactation points in mammary gland tissues. Our results provide some insight into the understanding of buffalo collagen family and present vital evidence for future functional studies.

Identification and classification of the collagen family in buffalo
In the present study, an integrative analysis of BLAST and HMMER showed that a total of 128 putative collagen protein sequences were identified from the buffalo whole genome, belonging to the 45 buffalo collagen genes ( Table 1). The open reading frames (ORFs) of the collagen proteins ranged from 1,317 to 9,654 bp in length encoding the protein of 438 to 3,217 residues, with the predicted MW from 107.46 to 786.60 kDa. The pI values of these proteins ranged from 4.54 to 4.92. Furthermore, the phylogenetic analysis revealed that the identified collagen genes could be classified into six 5 clusters ( Figure 1). The top 3 clusters contained 8, 7, and 4 collagen genes, corresponding to the Cluster II, Cluster I, as well as Cluster IV and VI, respectively. The smallest group is the Cluster V with only two collagen members (collagens XV and XVIII). The constructed dendrogram further showed that buffalo collagen family was generally most closely evolutionary relationship with other five representative mammals.
Gene Structure, Conserved Motif Analysis, and Promoter Analysis As shown in Figure 2A, the exon numbers of buffalo collagen family ranged from 2 to 118, and most of them harbored 40 to 60 exons. The analysis of intron numbers in the studied species showed that the intron number in the collagen family varied widely, ranging from 1 to 117 (Supplementary Figure S1).
Most collagen sequences contained more than forty introns, whereas only the COL8A1, COL8A2, and COL10A1 contained fewer than three introns. The collagen members in the same group had similar numbers of exon and intron among the studied species. These findings suggest that different groups in the collagen family have different patterns of intron number, which verify our previous classification process.
The MEME analysis showed that a total of 10 conserved motifs were identified in all collagens ( Figure   2B). After the Pfam search, the motifs 3 and 5 were annotated as the collagen domain ( Table 2), suggesting that at least collagen domain existed in the identified collagens. The results were also supported by the identified collages blasted against the conserved domain database (CDD) from NCBI ( Figure 2A). Interestingly, the Fibrillar collagen C-terminal domain (COLFI), Fibronectin type 3 domain (FN3), C-terminal tandem repeated domain in type 4 procollagen (C4), von Willebrand factor (vWF) type A domain (VWA), von Willebrand factor type C domain (VWC), and EMI domain also was determined in some collagen genes.
All identified buffalo collagens with the 1500 bp upstream sequences of coding sequence were selected and used to predict the TFBS. The results exhibited a total of 142 TFBS were predicted within the buffalo collagen genes ( Figure 3A). Among them, the largest TFBS detected in the identified collagen genes was the Myeloid Zinc Finger 1(MZF1), with the numbers of 61 (7.96%), followed by the  Figure S2). For the buffalo collagen family, moreover, we found that the MZF1 element was mostly detected in the Clusters I, III, and V, the MEIS1 element was mostly predicted in the Clusters II and VI, and ZNF384 element was mainly discovered in the Cluster IV (Supplementary Figure S3).

Chromosomal location and Gene Duplication Analysis of buffalo collagen family
The 128 buffalo collagen sequences were distributed on 17 chromosomes randomly ( Figure 4). Among them, chromosome 2, 4, 1, and 3 separately had 26, 18, 15, and 13 collagen each, chromosome 10 and X both contained 7 collagen each, chromosome 6 possessed 8 collagen each, chromosome 14 contained 6 collagen sequence, chromosome 7 and 15 both had 5 collagen each, while the remaining chromosome has less than 5 collagen each. Notably, the buffalo collagen genes were mainly located on the proximate or the distal ends of the chromosomes.
To investigate the evolution progress of putative gene family, we analyzed the duplication events of buffalo collagens. Here, a total of 103 (80.47%) collagen sequences were confirmed to be tandem duplicated genes. The majority of collagen tandem duplicated genes located on chromosome 1 ~ 4.
Four separate pairs of collagen tandem duplicated genes located on chromosome 10 and X. Four groups of three tandem duplicated gene pairs located on chromosomes 6, 7, 14, and 15. Four chromosomes (9, 13, 23, and 24) had the tandem duplicated gene pairs with the numbers of ≤ 2.
However, only one segmental duplication event (COL1A1 and COL1A2) was discovered in the buffalo collagen family. These results suggested that the tandem duplication played a predominant role in the expansion of buffalo collagen family.

7
To further explore whether the positive selection was involved in buffalo collagen genes, we analyzed the Ka/Ks ratios for each duplicated collagens pair ( Table 3). The results showed that the COL14A1_2/COL14A1_1 duplicated collagen pair was identified with the Ka/Ks ratios > 1, implying that this gene pair might involve in the positive selection. Moreover, a total of 23 duplicated collagen gene pairs with the Ka/Ks ratios < 1 were identified, 8 of which might experience strong purifying selection pressures because they had the Ka/Ks ratios were less than 0.5.

Expression profiling of buffalo collagen genes
To investigate the tissue expression distribution of the buffalo collagen family, we examined the expression patterns of 45 collagen genes using the public RNA-seq data. The results showed that a total of 19 collagen genes were not expressed in any tissues of buffalo, including the COL10A1, COL11A1, COL13A1, COL19A1, COL1A2, COL20A1, COL21A1, COL23A1, COL24A1, COL25A1, COL26A1, COL4A4, COL4A6, COL5A1, COL6A3, COL6A4, COL6A5, COL6A6, and COL8A2 ( Figure 5A). A total of 26 collagen genes was detected from at least one tissue and displayed differential expression in various buffalo tissues. Among them, 5 (COL9A3, COL6A1, COL6A2, COL3A1, and COL5A2) genes showed highest expression levels in the testis, while 3 (COL14A1, COL4A1, and COL4A2) genes had highest expression levels in the ovary follicle, indicating that these genes might be related to the reproductive traits in buffalo. Interestingly, the COL16A1 gene exhibited the highest expression levels in the mammary gland, suggesting that it might involve in the biological function of milk production trait in buffalo. Moreover, the result of qRT-PCR also showed a similar trend with the RNA-seq data ( Figure 5B).
Using the RNA-seq data, we further dissect the transcript levels of buffalo collagen family in mammary gland tissues at four lactation points. The results showed that 29 out of 45 collagen genes were mainly up-regulated at the D7, six collagens (COL6A6, COL3A1, COL1A1, COL27A1, COL8A1, and COL14A1) upregulated at the D50, two collagen genes (COL4A3 and COL10A1) were mainly expressed 8 at the D50 and D140, and only the COL24A1 was up-regulated at the D280 ( Figure 6). The results of qRT-PCR also demonstrated that four collagen genes in buffalo were up-regulated in the D50 and two collagens were highly expressed in the D7, implying that the RNA-seq data in the present study was available.

Discussions
Collagens, as extracellular matrix molecules, support cells for structural integrity and a variety of other functions [11], thus contribute to support mammary basic structure and development [28].
Currently, our understanding of the functional role of buffalo collagen family is limited. In the present study, we identified 128 collagen protein sequences in buffalo based on its complete genome sequence. The identified collagen protein sequences corresponded to the 45 collagen genes in buffalo, which was classified into six clusters based on their evolutionary relationships. The result was consistent with the previous classification of collagens described by S Ricardblum [9]. The conserved motif and gene structure analyses of buffalo collagen genes also supported this classification perspective. Conserved motif analysis indicated that all the identified collagens harbored at least collagen domain. Interestingly, the collagen cluster I genes contained a fibrillar collagen C-terminal domain (COLFI). For the gene structure analysis, most of the collagen genes contained 40 to 60 exons.
The intron number analysis revealed that the majority of collagen genes contained more than forty introns, whereas only the COL8A1, COL8A2, and COL10A1 contained fewer than three introns. These results coincide with the exon-intron structure of collagen genes from other representative mammals, suggesting that collagen family had a conserved gene structure.
For the identified buffalo collagens, we further predicted the TFBS in their promoter regions. The results exhibited a total of 142 TFBS were detected within the buffalo collagen family. We found that the largest TFBS with the numbers of 61 was the MZF1 (7.96%), followed by the MEIS1 (5.61%), ZNF354C (5.48%), RHOXF1 (5.22%), KLF5 (4.31%), ZNF384 (3.26%), and SPIB (2.74%). Moreover importantly, we found that the MZF1 element was mostly detected in Clusters I, III, and V of buffalo collagen family, the MEIS1 element was mostly predicted in the Clusters II and VI, and ZNF384 element was mainly discovered in the Cluster IV. Previous studies have demonstrated that as an Ncadherin transcription factor, MZF1 could participate in the epithelial-mesenchymal transition that plays a critical role in cell invasion and metastasis [29]. It is noted that cell migration through the extracellular matrix was a key component of morphogenesis [30], suggesting that these genes in the Clusters I, III, and V of buffalo collagen family might be involved in the mammary gland morphogenesis. MEIS1 has been demonstrated to be a transcriptional regulator of PAX6 that is required for the normal development of the endocrine cells [31,32]. This indicated that the MEIS1 along with PAX6 might be involved in the insulin secretion and synthesis pathway, thereby resulting in the changes of lactation function in mammary gland because the insulin plays a vital role for the mammary gland differentiation and subsequent lactation [33,34]. Our data also revealed the MEIS1 element was mostly detected in Cluster II and VI, implying that the biological function of them might be related to the insulin secretion and synthesis. Moreover, evidence indicated that ZNF384 as a transcription factor could bind and regulate the promoters of the extracellular matrix genes that sequences were confirmed to be tandem duplicated genes, but only one segmental duplication event to be discovered, which revealed that the tandem duplication had a predominant role in the expansion of buffalo collagen family. This finding was supported by Liu et al. [39] who highlighted that segmental duplications in most mammalian lineages are organized in a tandem configuration. For the duplicated collagen pairs, interestingly, we found that the COL14A1_2/COL14A1_1 duplicated collagen pair underwent the positive selection during the evolution of buffalo collagen family. Furthermore, twentythree pairs of duplicated collagens with the Ka/Ks ratios < 1 were identified, eight of them with the Ka/Ks ratios were less than 0.5 that might experience strong purifying selection pressures.
Tissue expression profiles from the RNA-seq data analysis were presented in this study, showing that a total of 26 collagen genes was detected from at least one tissue and displayed differential expression in various buffalo tissues. In particular, buffalo collagen genes in the same clusters had no consistent expression. For example, a total of five (COL9A3, COL6A1, COL6A2, COL3A1, and COL5A2) genes showed highest expression levels in the testis, while three (COL14A1, COL4A1, and COL4A2) genes had highest expression levels in the ovary follicle, indicating that these genes might be related to the reproductive traits in buffalo. The reason for the difference could be the limitation of available expression data for buffalo collagen. Importantly, some of the collagen genes exhibited tissue-specific expression, such as COL16A1 was mainly expressed in the mammary gland, while the COL2A1, COL8A1, and COL27A1 were explicitly expressed in the thymus, corpus luteum, and cerebellum, respectively. Apparently, the collagen family has been demonstrated to have various functions in multiple tissues because they contribute to maintaining the structural integrity and stability of tissues and organs [10]. Compared with the expression pattern represented by RNA-seq data, the expression profile generated by qRT-PCR was not wholly equal to that. We assumed that there existed multiple Previously studies showed that the members of the collagen family are the most abundant proteins in ECM that are tightly regulated throughout the development of the mammary gland [28,43,44]. Therefore, to explore the expression pattern of collagen genes in the buffalo mammary gland at different lactation points is necessary, which contribute to dissect the potential roles of these genes in the mammary gland morphogenesis. In the present study, our data revealed that most of the collagen genes are mainly expressed in the mammary gland at early lactation. It is noted that a total of 6 buffalo collagens (COL6A6, COL3A1, COL1A1, COL27A1, COL8A1, and COL14A1) was upregulated at the peak lactation, two collagen genes (COL4A3 and COL10A1) were mainly expressed at the peak and mid-lactation, and only the COL24A1 was up-regulated at the late lactation. These results were also supported by the qRT-PCR. Interestingly, Dai et al. [45] found that COL8A1 and COL1A2 both were upregulated in the bovine mammary gland during lactation compared to the dry period. Our findings suggested that these buffalo collagen genes displayed the specific biology function at different stages of mammary gland development. However, these findings remain to be confirmed.

Conclusion
In this work, we performed a genome-wide identification of buffalo collagen family, with 128 collagen protein sequences confirmed. Next, we performed the analyses of buffalo collagen genes, including the phylogeny, gene structures, chromosomal location, gene duplication, and expression patterns in different tissues. Most of the collagen genes were expressed in the different buffalo tissues and were up-regulated in the mammary gland at the early lactation, indicating that these genes play vital roles in the milk production of buffalo. The study provides valuable information on the collagen family in buffalo and will helps in determining the collagen gene functions.

Animal source
A total of eleven Murrah buffalo was used as the animal material in the present study, which was kept at the Buffalo Research Institute, Chinese Academy of Agricultural Sciences (BRI-CAAS). A total of three buffaloes were slaughtered and used for the fresh tissue collection, including brain, heart, liver, spleen, lung, kidney, stomach, large intestine, ovary, oviduct, pituitary, uterus, hypothalamus, lymph, and mammary gland. The biopsy samples of mammary gland tissue from eight buffaloes were 12 collected on days 7 (D7), 50 (D50), 140 (D140), and 280 (D280) after calving, which was further used for the real-time quantitative PCR (qRT-PCR). The biopsy procedure was performed based on the method reported by Schmitz et al. [46]. All fresh samples were immediately preserved in the liquid nitrogen until use.

Identification of the buffalo collagen genes
The whole-genome data of six representative mammals including the human (GRCh38.p12), cattle  [50]. Exon/intron structure analysis was performed by the buffalo genome annotation file using the in-house scripts. We analyzed the conserved protein motifs of collagen proteins in buffalo using the MEME programs (http://memesuite.org/tools/meme) at a maximum number of motifs, 10. Moreover, the transcription factor binding sites (TFBS) in the identified collagen promoter region with 1500 bp in length were predicted using the TFBSTools [51] R package. The main parameters of TFBS analysis were employed as the following: 13 JASPAR2018 was used to retrieve data from JASPAR database, the Homo sapiens was as the reference species, the minimum score for the hit was more than 0.98, and only the positive strand was considered.
Chromosome locations of each collagen gene were obtained from their genome resources. Collagen duplication events were identified using the Multiple Collinearity Scan Toolkit (MCScanX) previously described by Wang et al. [52]. Overall, a total of 58,532 buffalo protein sequences were analyzed using the BLAST search with E-value < 1E-5, and then the corresponding gene positions file were obtained from the buffalo genome annotation file with the GFF format, which can be set as the input files of MCScanX program.
For the duplicated collagen genes, we further performed the divergence estimates and diversity analysis. In brief, MEGA7 software with the MUSCLE algorithm was firstly utilized for the pairwise alignment of the homologous collagen gene pairs. Subsequently, the DnaSP v6.0 [53] software was employed to calculate the pairwise synonymous (Ks) and nonsynonymous (Ka) numbers of substitutions corrected for multiple hits.

Expression pattern Analyses
In order to determine the tissue expression profiles of collagen family, the RNA sequencing (RNA-seq) data (BioProject: PRJNA207334) from 29 different Italian Mediterranean buffalo tissues was used to calculate the fragments-per-kilobase-per-million fragments mapped (FPKM) values of buffalo collagens. Briefly, FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) program was used to remove the adapter sequence from the raw sequence reads. The clean reads were aligned to the buffalo reference genome using Hisat2 [54] with default parameters. FPKM values were calculated using the StringTie [55].
To explore the expression profiles of the buffalo collagen family in mammary gland tissues at different lactation, including the early lactation (D7), peak lactation (D50), mid-lactation (D140), and late lactation (D280), the raw data (BioProject ID: PRJNA480718) from buffalo mammary gland tissues submitted by our group was used. Hierarchical clustering was performed using the Pheatmap [56] R package with the normalized scale method.
14 qRT-PCR Analysis Total RNA for the different buffalo tissues was isolated using the RNA Plus reagent (Tiangen, China).
For the 2 μg total RNA, first-strand cDNA was synthesized using a reverse transcriptase kit (Takara, Dalian, China). SYBR® Premix Ex Taq™ (Takara, Dalian, China) was used for the qRT-PCR analysis, which was monitored on the LightCycler 480 (Roche, CH), and each reaction was performed in triplicate. The expression levels of buffalo collagen family were analyzed by the 2-ΔΔCt method [57] and normalized using the GAPDH expression analysis. The collagens primer pairs used for the qRT-PCR analysis are shown in the Supplementary Table S1.

Consent for publication
Not applicable.

Availability of data and material
The data supporting our findings can be found in the additional supporting files.

Competing interests
The authors declare no competing financial interests  Table S1. Primers of qRT-PCR for collagen genes in buffalo Additional file 2: Figure S1. Information on intron numbers of the collagen family in studied species.
Additional file 3: Figure S2. Summary of the TFBS detected in the buffalo collagen genes.
Additional file 4: Figure S3. Summary of the TFBS detected in the buffalo collagen subfamilies.   Figure 1 Phylogenetic relationship of collagen proteins in 6 mammals. Different capitals represent different species: B means buffalo, C means cattle, G means goat, S means sheep, H means horse, and Hu mean human; Line with different colors indicates different subfamilies.   Tissue expression distribution of buffalo collagen using RNA-seq data and qRT-PCR. (A) tissue expression distribution of buffalo collagen genes based on the transcriptomic data.

Tables
(B) tissue expression pattern of buffalo collagen genes based on the qRT-PCR.
28 Figure 6 Cluster analysis of buffalo collagen in mammary gland tissue at different lactation points.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.