Adaptive Evolution after Gene Duplication in Maize

Background: Gene duplication can provide genetic basis for the evolution. The neo/sub-functionalization of redundant copies can promote plant growth and development, improve environmental adaptability, and even form new species. Methods: We systematically identified the duplication genes and their origin of maize by comprehensively utilizing methods of homologous clustering, chromosomal collinearity, Ks analysis and phylogenetic analysis. The distinction of duplicated genes was analyzed by comparing the gene structure, sequence composition, expression, and functional differentiation. Results: More than 70% of the genes in the maize genome have been found to be duplication genes, and mainly derived from the whole-genome duplication (WGD). The gene structure of the WGD genes is more complex, with rich components and long CDS. The GC content of WGD genes contributes to the bimodal distribution, indicating that a part of the duplicated gene contains abundant GC bases. The expression level of WGD genes is high and possesses tissue-specificity. The functionally differentiated duplication genes are mainly involved in the growth and stress response of maize. Conclusions: The results of this study indicate that duplication genes produced by WGD provide new adaptability to maize growth, morphogenesis, and biotic and abiotic stress resistance, which are important to the evolution of maize. morphogenesis, and biotic and abiotic stress resistance, which are important to the evolution of maize. These effects were mainly achieved by improving genetic material, enriching gene components, the divergence of gene expression and function. These results provide new insight into the evolution of duplication genes and their role in the evolution, and duplication events should be considered in future genome and gene family studies to account for the species and gene evolution.


Background
Gene duplicates (GDs) often mainly result from chromosomal unequal fragment exchange, transposition and whole-genome duplication (WGD) [1][2][3][4]. Three mechanisms account for different proportions of GDs in different species, and WGD is more prevalent in plant kingdom [5]. GDs provide raw materials for evolution, which potentially alter gene dosages and function by neo/sub-functionalization. GDs with novel gene functions could be retained due to drift and selection. Like genetic mutations, gene duplication occurs first in individuals, and it has only a small probability to be retained when a duplication gene is neutral before duplication [6]. Even if the duplication gene remains in the population, it is still straightforward to lose unless it is beneficial to the organism [7,8]. The probability of being retained in the population is based on its suitability. Unless the duplication gene is useful to the organism and its losing will result in a decrease in fitness, otherwise it cannot be stably retained in the genome [4,[9][10][11][12][13][14][15].
GDs are present in almost all eukaryotes [16][17][18][19], which often evolve discrepant function or expression patterns among copies [20][21][22][23][24][25]. WGD generates a copy of all genes, with greater potential than single gene duplication for the evolutionary success of enhancing organismal adaption to environmental challenges and appearing novel traits. Analysis of expression data in multiple species revealed differences in expression between duplication genes [26][27][28][29]. Also, the study analyzed the differences in microRNA regulation between paralogs in Arabidopsis, rice, and canola, and proposed their contribution to the difference in expression between duplication genes [30][31][32]. Duplication genes were involved in the evolution of important functions of speciation, and the existing studies had found that the duplication genes of angiosperms led to the diversification of seed and flower development regulating genes, which had a great influence on the angiosperm rise [33]. In the past 350 million years, three whole-genome duplications in Arabidopsis directly led to an increase in transcription factors, signal transducers, and development-related genes [34]. The evolution of C4 photosynthesis in Sorghum bicolor (sorghum), and Zea mays (maize) involved the duplication of the C3 gene and subsequent functional differences [35,36]. Through the differential differentiation of expression and function, the retained duplication genes played a key role in improving species fitness [37][38][39][40].
Maize shares an ancient ancestor 50-85 million years ago (MYA) between other grasses [41,42], and the sorghum which is more closely related to evolution, is presumed to be divided into 12MYA years ago [43]. Maize genome is very complex, with rich repeated sequences, transposons, and paralog genes. Previous studies speculated that maize has undergone at least three whole-genome duplication events, including before the divergence of the monocotyledonous plant, before the divergence of the grass, and after the formation of the maize. The WGD events occurred about 110 MYA, 50 MYA and 12 MYA, respectively [44][45][46]. Previous research focused on the identification of WGD events, without a comprehensive analysis of retaining maize gene duplicates and their impact on development and adaptive evolution. We used maize, sorghum, Oryza sativa (rice) and Arabidopsis genes for homology clustering, collinearity analysis, and phylogenetic analysis to identify duplication genes in the maize genome. After that, differences in sequence, expression, and function between duplication genes were analyzed. The effect of maize duplication genes on the adaptive evolution of maize was further explored.

Identification of duplication genes in maize
All the proteins of maize, sorghum, rice and Arabidopsis were collected as the subject database, and BLASTP of each protein was performed to obtain similar sequences. 7 Methods were used to cluster all the similar sequences, and 1,119,261 collections were obtained at last (Fig. 1a). Because both mcl and OrthoMCL are based on Markov model, the number of them are similar, which were 23,052 and 22,114 collections, respectively. The number of collections of Hcluster_sg, Hifix, MC-UPGMA, SiLix and SPICi methods are similar, which were 13,691, 14,619, 13,445, 14,611 and 11,826 collections, respectively. In these collections, a total of 17,137,340 gene pairs were obtained. It was found that 49% of all the gene pairs were identified by two or more clustering method, among then, 13% gene pairs were identified by two methods, and 12% gene pairs were identified by three methods (Fig. S1).
Gene pairs were identified by at least two methods, which would be considered as candidate gene pairs for further analysis. For the candidate dataset of all the gene pairs, 33,157, 31,165, 26,918 and 24,240 genes were involved in maize, sorghum, rice and Arabidopsis, and accounting for 84%, 90%, 75% and 88% of total genes of each species, respectively (Fig.1b). Moreover, based on the candidate dataset, we reconstructed 13,388 new collections and count the number of gene involved in new collections between species (Fig. 1c). A total of 1,175 collections only have maize genes, which was more than that in the other three species. Results found most genes from four species were included in new collections, which indicated that four species were closely related. Genomic collinearity analysis was performed using MCScanX to determine which GDs originated from WGD. A total of 311,199 alignment results from previous blast were introduced, setting a threshold of 1e-5, and finally get 6,146 collinear blocks, of which 454 collinear blocks were obtained in maize, 403 blocks between Arabidopsis and maize, 839 blocks between rice and maize, and 817 collinear blocks in sorghum and maize (Table S1). The 6,146 collinear blocks contained a total of 75,606 genes, accounting for 55% of the total gene count of 137,060 of the four species.
The distribution of Ks can be used to estimate the occurrence time of these WGD events. The Ks values between gene pairs in all collinear blocks were calculated in the analysis (Fig. 2). There were two distinct Ks peaks inside the maize, which was consistent with the results of the two WGDs mentioned in the previous study [44][45][46]. The first duplication occurred at about 50 MYA; the most recent large-scale duplication incident occurred about 12 MYA. It was speculated that it had experienced a relatively large scale before the separation of Arabidopsis and maize, sorghum, and rice about 110 MYA. The genes produced by this ancient duplication were partially retained in each species.
Finally, we classified all maize genes according to duplication events. There were 28,911 duplication genes, accounting for 73.2% of maize genes. Further, a total of 21,654 duplication gene have homology and collinear relationships, defined as the syntenic duplication genes, accounting for 54.9% of the total genes. Combining the phylogenetic analysis of these gene, maize syntenic duplication genes were classified into four categories Maize WGD, Sorghum-Maize, Pre-grasses WGD and Pre-monocots WGD. The number of genes in categories was 9,606, 3,309, 5,917, 2,922, respectively, accounting for 26.9%, 13.5%,15.3% and 44.4% (Fig. 3, Table S2).

Characteristic of WGD genes in maize WGD genes with longer CDS and more exons
According to the gene structure, we compared GDs originated from different ways, including the length of gene, 5'UTR, 3'UTR, coding region, exon and intron, number of exons. The average length of the single-copy gene is 5,364 bp, and of GDs is 4,495 bp. The mean CDS length of all genes is 1,096 bp, and the values of single-copy gene and GDs are 1,314 bp and 1,206 bp, respectively. Similar results exist in the length of the UTR region, the length of the intron, and the length of the exon (Fig. S2). It can be seen that the maize syntenic duplication gene and the single copy gene have more gene component numbers or longer gene components. The Pre-monocots WGD genes had higher transcripts number, average gene length, and average CDS length and average intron length ( Table 1, Table S3).

The maize WGD gene contributes to the bimodal GC distribution
The level of GC can affect the interaction of base stacks and thus play an important role in stabilizing the gene structure [61]. In earlier studies, the GC3 distribution of different organisms could be divided into two main types, unimodal and bimodal [62,63]. In our analysis, the GC3 of the coding regions of different genes of maize were calculated. The GC3 content distribution of the syntenic duplication gene and the tandem duplication gene formed a distinct bimodality, indicating that the partially duplication gene had high GC3. The genes with other homolog relations are all unimodal, and most of the genes have low GC3 content (Fig. 4a). The GC3 content distribution of the duplication genes generated by different WGD showed that except for the Pre-monocots WGD duplication gene, other WGD genes formed bimodal (Fig.  4b). Pre-grasses WGD produced the largest proportion of high GC3 genes. Some of the duplication genes in the Maize-Sorghum and Maize WGD stages are also high GC3. Since these genes may be derived from the high GC3 gene produced by Pre-grasses WGD, it is speculated that Pre-grasses WGD is the main source of high GC3 genes in the maize genome.

High average expression and tissue-specific expression in WGD genes
In addition to sequence differences, changes in transcription levels are also an important part of genetic differences. The transcriptional data for 60 different periods and locations of maize reference genome B73 were used to understand the differences in transcription levels between different types of duplication genes [64]. A comparison was made between genes of different homology relations and different WGD genes in maize, in which a total of 39,457 genes have transcripts. The average values of transcription of six tissues between different genes were counted, an average transcription value (log2 (FPKM)) is 4.1 in the WGD gene, 4.6 in the single copy gene; this value is 1.22, 2.04, 1.57 and 1.66 in the genes without homologous, maize-specific, tandem duplication, and other duplication type, respectively ( Fig. 5a/b). The results indicate that single copy genes, as well as WGD genes, have higher levels of transcription in maize. Further into the genes between different WGD events of line comparison and found that the average gene expression levels between different WGD events produced no significant difference.
The differentially expressed genes were screened by calculating the difference between the expression levels of different tissues and the mean values of all tissues. A total of 16,207 genes with differential expression were obtained (Fig. 5c). Most of the tissue-specific genes were syntenic duplication genes (9,902), accounting for 61% of tissue-specific genes. From the number of syntenic duplication genes expressed in specific tissues, the expression genes in anther, root, stem, leaf, seed, embryo and endosperm were 3,254, 2,911, 2,080, 1,608, 2,768, 3,306 and 4,032, respectively. Among them, genes specifically expressed in the endosperm accounted for 41% of the syntenic duplication genes. Further, there were 3,709 tissue-specific genes accounting for 39% of the total number of Maize WGD genes. Other duplication genes related to Maize-Sorghum, Pre-grasses WGD, and Pre-Monocots WGD were 1,626, 3,307, and 1,206, respectively, accounting for 38%, 28%, and 9% (Fig. 5d). The results showed that genes retained after Maize WGD and Pre-grasses WGD had more obvious tissue specificity in expression.

Maize WGD genes are mainly involved in developmental and stress responses
The functional analysis includes gene ontology annotation and metabolic pathway analysis of genes. The ontology annotations of maize recent WGD genes enriched in plasma membrane, transcription factor activity, response to abiotic stimulus (Fig. S3, Table S4). The Pre-grasses WGD genes were enriched in a similar functional type to the maize recent WGD gene. Responsive to stress in biotic or abiotic stress, cell death, flower development, and pollen-pistil interaction in the biological process (Table S5).
KEGG metabolic pathways enrichment analysis of the different type of the duplicated genes in maize found, in the most recent Maize WGD genes, the metabolic pathway is enriched in MAPK signaling pathways, plant hormone signaling, autophagy, and circadian rhythms ( Table 2). The Pre-grasses WGD genes are enriched in amino acid metabolism-related pathways, sugar and fatty acid metabolism-related pathways, starch and sucrose metabolism, and the plant-pathogen interaction pathway, which play important roles in plant development and response to biotic and abiotic stresses [65] (Table S6).
The MAPK signaling pathway was enriched in KEGG analysis, which played an important role in the perception of stimulation and adaptive responses. In plant, MAPK is a complex family, involved in signal transduction of response to pathogens, drought, salinity, cold, injury, ozone, ROS and hormone stress [66][67][68][69]. Previous studies found 19 MAPK genes in maize [31,70], and we identified 22 MAPK genes in the same collection, 14 of which had undergone the most recent maize WGD, and the remaining 8 genes were retained after Pre-grasses WGD (Table 3, Fig. 6a). Among them, the most studied gene ZmMPK6-2 which has been assigned to function at low temperatures and ABA signaling [71,72]. The gene ZmMPK6-1, which has a very close evolution relationship with ZmMPK6-2, was formed by the most recent maize WGD. The expression of MAPK genes was further analyzed (Fig. 6b). Pre-grasses WGD retaining genes were mainly differentially expressed in embryo, endosperm, and pollen, while Maize WGD retaining genes were widely expressed in maize roots, stems and leaves. This different event preserves the tissue differences in gene expression, suggests that maize duplication genes have an important impact on the enrichment and functional building of the MAPK family of genes.

Discussion
Comprehensive homology, collinearity, Ks distribution, and phylogenetic analysis contribute to accurately identify WGD genes More than 80% of the sequences in the maize genome are transposons, and some transposons are extremely active [73]. In order to accurately identify WGD, further analysis of homologous gene aggregation/cluster is required. WGD typically leaves some characteristic collinear fragments at the chromosome level and results in more paralogous genes than conventional genes. Using these features, Ks can be used to estimate duplication occurs time and the Ks-based approach have been widely used [7,74,75]. In our results, Ks analysis within the collection can relatively accurately identify maize WGD which occurred about 110 million years ago, 50 million years ago, and 12 million years ago, which is the same as the results of previous studies [44][45][46]. At the same time, method based on collinear relationship identification need to have a collinear relationship at the whole genome level, which is also widely used [46,76,77]. Base on the method, we have identified 6,146 collinear blocks, which contained a total of 75,606 genes, accounting for 55% of the total genes of four species. However, it is susceptible to high levels of genomic rearrangement or loss of gene duplication, which is widespread and rapidly changing [78][79][80]. Recent studies had found that saturation effects might affect changes of Ks [81], making Ks estimated inaccurate duplication times. Jiao et al. attempted to construct a more accurate WGD event map by means of phylogenetic trees [33]. This method has better universality than a single collinearity-based or Ks-based method and has the advantage of simultaneously estimating the time and phylogeny of the WGD. However, it is not clear how the selection of a particular node of a particular gene family affects the outcome, as all selected duplicated nodes must be located within a given time period. It was unclear what kind of duplication event background distribution should be used during this time, or how the branch length estimation error affected the background distribution [82]. In our analyses, we are used a combination of collinearity, Ks distribution, and phylogenetic analysis and final obtained 15,158 gene pair of maize with a syntenic duplication relationship. For PGDD, which was a database of gene and genome duplication in plants [44], this number was 13,408. Among them, 9984 gene pair relationships were detected in our results and PGDD, accounting for 74.5% of the PGDD gene pair set. Explain that the results obtained are credible. At the same time, the Ks distribution of all homologous gene pairs was analyzed, and the results were similar to the Ks distribution of the collinearity genes, indicating that the results obtained in this study are accurate and reliable. In addition, since the predicted maize-sorghum divergence time is close to the time of the most recent WGD of maize, it is speculated that the large-scale genome duplication of maize or its ancestors formed the divergence of maize and sorghum species.

WGD genes with more complex gene structure
Our work with gene composition shows that WGD genes in maize have a more complex structure, showing longer gene, CDS and UTR length, as well as a greater number of exons and transcripts. Studies of polyploid plant evolution also suggested that WGDs provides genomic novelty and complexity to enhance organisms' environmental challenges and improve their own adaptability [83,84]. It was inferred that complex genes preferentially reserved after WGD, or that genes began to enrich their composition in order to adapt to the challenge environment and finally reserved. Longer CDS is correspond to long protein, providing more abundant protein secondary and tertiary structure variants, and a greater number of exons provides the basis for transcript variants. After WGD, the duplicated copies of single-copy genes were not retained, the possible reason was the important and conserved in function, which are related to the basal metabolism and regulation of the organism, the dosage alteration would cause organisms to death. If the WGD is experienced during species evolution, it means that one of its duplicated genes has been lost in the process of evolution. The retention of genes with a significantly richer genetic composition also means that they are more favorable in evolution and are preserved. It is further proved that genes after WGD remain in evolution and gradually form more abundant gene composition.

Duplication genes with high GC contribute to bimodal GC distribution of maize
A large number of studies have analyzed the GC content of different species genomes [63,[85][86][87][88]. Previous studies have shown that there are two main types of gene GC distribution in different organisms, namely unimodal and bimodal. The double peak indicates that some of the genes in the genome have higher GC content, and the other part have lower GC content. The GC content of grasses plants, Musa, Zingiberaceae, and warm-blooded animals showed a bimodal distribution, and cold-blooded animals and other plants (including dicots) showed a unimodal distribution [63,85]. This study found that the GC bimodal of maize is mainly derived from tandem duplication and syntenic duplication genes. In the WGD gene, the duplication gene of the Pre-grasses WGD has the greatest contribution to the GC bimodal. It is speculated that due to the Pre-grasses WGD, the GC content of the grass ancestral gene is bimodal, resulting in the GC content of most of the grasses being bimodal.
Previous studies have suggested that high and low GC differences in genes are affected by selection pressure [63,85,89], and many research evidence on animals supports GC-preferred gene transfer (gBGC) to play an important role in the GC-preferring process of forming species genes [90][91][92]. Glémin et al. proposed a simple model that under the gBGC hypothesis, strong 5'-3' recombination and gBGC gradient behavior can produce a bimodal distribution of GC content, and a slower gradient may result in a single peak distribution [93]. In our results, Pre-grasses WGD occurred at about 110MYA which in the late Cretaceous [94]. It can be speculated that the higher surface temperature at that time may be the cause of Pre-grasses WGD, because of at elevated temperatures, increased GC content can enhance base stacking forces, thereby increasing base stability [61]. And base on the gGBC model, it can be speculated that the WGD of the individual causes the change of gGBC effect before grasses divergence, and the evolution of the gene after duplication separates high GC content genes and extends to populations and species, and finally forming bimodal in GC. Therefore, Pre-grasses WGD has an important contribution to the bimodal distribution of GC content in ancestors of grasses plants and various grasses plants after divergence.

WGD genes contribute to enhance environmental adaptability in maize
Artificially increasing or deleting duplication genes demonstrates their impact on fitness. Gene in Salmonella enterica is amplified to a high copy number in real-time evolution system, and the duplicated gene accumulate mutations that provide enzymatic specialization of different copies and faster growth [11]. In addition, compared to the deletion of one of the duplication genes in Schizosaccharomyces pombe, the deletion of the duplicated gene in Saccharomyces cerevisiae significantly reduced their adaptability [40]. Duplication genes can enhance adaptability by altering gene expression, the previous study in angiosperms suggested that the duplicated MADS-box genes have been implicated in the regulation of a variety of flower developmental processes, and finally affecting the evolution of angiosperms [95]. Our work found the syntenic duplication gene of maize had a higher average expression level, indicating that the large-scale genome duplication genes were more active in expression. Further analysis revealed that the expression characteristics of maize syntenic duplication genes not only showed an increase in expression but also showed strong tissue specificity. A similar result was found in Gossypium hirsutum, the expression pattern of duplication genes in polyploid G. hirsutum has specificity at the organ and developmental stages, which have shown respond to stress [96,97].
Functional divergence is the most direct evidence of the adaptive impact of duplication genes. Our work found that the maize WGD gene was mainly involved in development and response to stress by gene ontology enrichment analysis, and the metabolic pathway was enriched in MAPK signaling pathways. These functions and pathways have an important impact on the adaptability of species [66][67][68][69]. Based on these results, we depict the effect of duplicating genes on maize adaptive evolution: First, some of the genes may be pseudo-geneticized due to factors such as duplication of function or disruption of the dose effect after duplication. Then, some of the duplicated genes show divergence in expression, especially tissue-specific expression, and functional differentiation to adapt to changing environment. Finally, under the effect of natural selection or artificial selection, copies that are beneficial to enhance individual fitness are retained.

Conclusions
The evolution of species is a complex process in which the duplication of genes plays an important role. We have comprehensively identified duplication genes in maize, particularly three whole genome duplication genes. Our results indicate that duplication genes produced by WGD provide new adaptability to maize growth, morphogenesis, and biotic and abiotic stress resistance, which are important to the evolution of maize. These effects were mainly achieved by improving genetic material, enriching gene components, the divergence of gene expression and function. These results provide new insight into the evolution of duplication genes and their role in the evolution, and duplication events should be considered in future genome and gene family studies to account for the species and gene evolution.

Methods
Maize, rice, sorghum, and Arabidopsis genes and genomic data were obtained from the Ensembl Plants database. The maize reference genome is the AGPv3 version, assembled in total length of 3,233,616,351 bp and annotated 39,469 coding genes (AGPv3 5b). In order to explore the evolutionary pathways of GDs in maize, sorghum, rice and Arabidopsis were selected as references (Fig. S4). The Arabidopsis reference genome was published in 2010 (TAIR10), with the total length of 135,670,229 bp and 27,416 annotation genes. The rice reference genome used IRGSP-1.0, with 374,424,240 bp in length and 35,679 genes. The sorghum reference genome version was Sorbi1, with 738, 540, 932 bp in length and 34,496 coding genes (Table S7). The longest and complete coding sequence (CDS) of each gene and its encoded protein sequence were used to analyze. CDS sequences that had corrected initial and termination codons were included in the dataset.
Gene similarity ratio Blast pairwise alignment using protein sequences. All the protein sequences of maize, sorghum, rice and Arabidopsis were collected as a target dataset, and BLASTP was performed using each protein as a query, with the E-value of 1e-5 and the maximum number of aligned sequences to select of 100. Gene cluster analysis was performed on genes of maize, sorghum, rice, and Arabidopsis using a variety of cluster analysis method (Table S8). Including SiLix [47]: using graph algorithms, constructing similar "sequence families" based on disjoint sets; efficient clustering method Hifix built on SiLix [48]; Mcl [49] and OrthMCL [50] based on Markov clustering algorithm; MC-UPGMA [51] which used unweighted cluster method using arithmetic averages; fast and efficient clustering SPICi [52] based on heuristic algorithm; Hcluster_sg [53] based on the constructing sparse graph.
Genomic collinear block analysis using MCScanX [54]. Downstream analysis of collinear block analyzed by collinearity. The coding region sequences were obtained for all aligned gene pairs, Ks and 4DTv were calculated, and then possible concatenated collinear blocks were filtered. The non-syntenic duplication of genes could be further classified by the collinear results. The synonymous substitution did not cause changes in gene function，which was often seen as a random event, so the difference between the two coding sequences could be estimated by counting the synonymous substitution probability between the two sequences. At the same time, according to Gaut et al.'s calculation, the average base replacement rate per dot per year in the gene sequence is 6.5×10-9 [55]. Maize-specific genes and genes with homologous could be obtained by homology analysis to analyze the evolutionary relationships between genes in the previous set of collections. First, the coding region sequence and protein sequence of each gene were obtained, and muscle was used to perform sequence alignment. Simultaneously calculate the Ks value and 4DTv value between the two genes; use treebest to optimize the sequence alignment results, build the gene tree and extract homologous relationships between genes. Gene cluster analysis could distinguish homologous genes into duplication genes and single-copy genes. The WGD gene was divided into the most recent, before the divergence of grass and before the divergence of monocotyledonous plants. The duplication gene generated by three WGD events was used as the basis for subsequent analysis.
In order to further exploring the evolution of duplication genes, the subsequent analysis is an in-depth analysis from two aspects: (1) Comparison between genes of different homolog relation. a), no homologous gene (No Paralog); b), no paralogous and orthologous single copy gene (Singleton); c), clustering only maize gene (All maize); e), Tandem duplication gene (Tandem); f), syntenic duplication gene (Syntenic Duplication), other types of duplication genes (Other Duplication). (2) Comparison between genes of different duplication events originating in syntenic duplication genes. a) the most recent WGD of maize (Maize WGD); b) Maize and sorghum divergence (Maize-Sorghum); c), the WGD of maize before the divergence of the grass (Pre-grasses WGD); d), WGD of maize before monocot divergence (Pre-Monocots WGD).
Gene component difference analysis of duplication genes used the transcript, and exon number, gene length, 5'UTR, 3'UTR, CDS, exon, and intron length of each gene were separately counted. Codon analysis of the gene uses the CDS sequence of the gene to calculate the GC and GC3 content of each gene and calculate the distribution of GC and GC3 content under the sliding window. The TATA Box and CAAT box were used to perform statistical analysis on the promoter region of 2 KB upstream of the gene. Using maize expression data published by Sekhon et al. in 2013 [56]. The differences between the genes of different origin maize genes and different WGD event genes were compared by statistical expression mean and expression levels of different tissues: roots, stems, leaves, seeds, embryos, pollen, filaments, and endosperm. Differentially expressed genes were analyzed by calculating fold changes. Functional annotations were performed using InterProScan [57], functional annotation of Gene Ontology (GO) [58], and enrichment analysis of gene ontology using BiNGO [59]. Metabolic pathway analysis was performed using the KEGG database [60].

Consent for publication
Not applicable.

Availability of data and materials
All relevant data are contained within the paper and its Additional files.

Competing interests
The authors declare that they have no competing interests.

Funding
This work was supported by the National Key Basic Research Program of China (No: 2014CB138202). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author's contributions
BW and HL designed the research; BW, HL, QX and YW performed research; BW, QX, YW, JZ, YH1, YL1, GY and YL2 analyzed data; BW and HL wrote the paper.

Color Key and Histogram
Count