Lineage-specific protein repeat expansions and contractions reveal malleable regions of immune genes

Functional diversification, a higher evolutionary rate, and intense positive selection help a limited number of immune genes interact with many pathogens. Repeats in protein-coding regions are a well-known source of functional diversification, adaptive variation, and evolutionary novelty in a short time. Repeats play a crucial role in biochemical functions like functional diversification of transcription regulation, protein kinases, cell adhesion, signaling pathways, morphogenesis, DNA repair, recombination, and RNA processing. Repeat length variation can change the associated protein’s interaction, efficacy, and overall protein network. Repeats have an intrinsic unstable nature and can potentially evolve rapidly and expedite the acquisition of complex phenotypic traits and functions. Because of their ability to generate rapid, adaptive variations over short evolutionary distances, repeats are considered “tuning knobs.” Repeat length variation in specific genes, like RUNX2 and ALX4, is associated with morphological and physiological changes across vertebrates. Here we study repeat length variation as a potent source of species-specific immune diversification across several clades of tetrapods. Moreover, we provide a clade-wise comprehensive list of immune genes with repeat types for future studies of morphological/evolutionary changes within species groups. We observe significant repeat length variation of FASLG and C1QC in Rodentia and Primates’ contrasting species groups, respectively.


INTRODUCTION
Immune genes are well-known hotspots of intense positive selection in several clades of vertebrates [1][2][3]. The abundance of selection signatures in immune genes results from an ongoing arms race between the host and the pathogen [4,5]. Adaptive evolution occurs in most interferon-stimulated genes [6]. In addition to the adaptive evolution of genes, changes in the gene content have helped deal with the changing pathogen repertoire [7,8]. Host genomes have also responded to pathogens through other strategies such as splice isoforms [9] and several forms of non-coding RNA [10,11] that counter pathogens. Hence, understanding the source of genomic novelty which can take on immune roles is an important question. Repeats in protein-coding genes are a source of evolutionary novelty but have received less attention [12].
Proteins of diverse life forms contain recurring motifs like tandem repeats, repetitive domains, and periodic structures [13][14][15][16]. In the human genome, around 20-28% of proteins have at least one stretch of homo-peptide residues [17,18]. We refer to such amino acid repeat sequences of low complexity as repeats. Repeats play a crucial role in biochemical functions like functional diversification of transcription regulation [19][20][21], protein kinases and cell signaling pathways [22], cell adhesion [23], morphogenesis [24], DNA repair, recombination, development [17], and RNA processing [25]. Repeats have an intrinsic unstable nature and can potentially evolve rapidly and expedite the acquisition of complex phenotypic traits and functions [12,14,26]. Because of their ability to generate rapid, adaptive variations over short evolutionary distances [27][28][29], repeats are considered "tuning knobs" [19,20]. A combination of relaxed purifying and positive selection drives fast divergence of the repeat sequences, presumably yielding new functions [12]. These novel functions may be of adaptive significance in different ecological niches [30].
Repeats are highly unstable and prone to contraction or expansion by recombinational mechanisms (like gene conversion or unequal crossing-over) or replication-slippage [22]. This volatile nature of repeats contributes to functional and morphological diversity, such as the repeat length variation in the case of the RUNX2 gene [31,32]. RUNX2 is a DNA-binding transcription factor that plays a crucial role in skeletal development in vertebrates [33]. RUNX2 contains two repeats: poly-Q and Poly-A adjacent to each other [34]. Studies have found a positive correlation between the Q/A ratio with facial shape and angle of facial bones compared to skull length between different breeds of dogs [32]. The QA repeat length variation in RUNX2 contributes to the evolution and diversity of skeletal structure across vertebrates [31].
The importance of repeats in parasites for cell adhesion, invasion, and immune evasion of host cells is overwhelming [35][36][37][38]. Several repeat-containing proteins are involved in hostparasite interactions, such as host cell entry, survival, adaptation to the intracellular environment, and escape from the host immune system [35][36][37][38]. Furthermore, specific repeats like Leucine-rich repeats (LRRs) are known to be exploited by both hosts and pathogens in host-pathogen interactions. The pathogens use LRRs to attach to the host cell and enter it. In contrast, the host cells utilize LRRs to recognize pathogen-specific molecular patterns [36,39], where they sense specific pathogen-associated molecules and activate the innate immune system [40,41]. Due to their variable nature, high abundance in immune-related genes, and consequences on function and morphology, repeats in immune genes can potentially contribute to the clade-specific diverse immune repertoire.
In contrast to previous studies, which focus on the functional implication of specific repeats, we characterize and quantify lineage-specific changes in several Tetrapoda clades by investigating the role of repeats in all immune genes. Repeats occur in a third of the immune genes and have experienced lineage-specific changes in length and composition. We reasoned that protein repeats occur in malleable regions of immune genes and are a potent source of evolutionary novelty. Our analysis focused on identifying the global patterns of repeat occurrence and expansion/contraction. In our multi-clade dataset, most repeats occur in the initial 20% or the final 10% of the protein and genes with a specific range of GC content. The proportion of genes with repeat expansion or contraction is species-specific. Our in-silico study identifies candidate genes with species-specific changes in repeat length, which may have functional implications.

MATERIALS AND METHODS Dataset preparation
To use a well-curated and stable set of genes with immune functions, we rely upon gene ontology (GO) annotation. We compiled a list of 2945 immune genes by downloading human genes annotated with the term "immune system process" (Accession # GO:0002376) from Ensembl release 106 [42] (see Supplementary Table S1). We obtained the full-length coding sequences and corresponding protein sequences of human immune genes and their one-to-one vertebrate orthologs for 216 species spanning 12 clades from NCBI [43] (see Supplementary Table S2). We filtered out readthrough annotations and those genes which lacked ortholog annotation across at least four species for each clade resulting in a final set of 2488 genes. In the case of multiple splice isoforms, we chose the sequence with the most similar length and sequence identity to the other orthologs. We downloaded the species tree for non-avian clades from the TimeTree [44] and the BirdTree [45] websites for birds.

Identification of protein repeats
We identified the repeats in protein-coding regions in protein sequences using fLPS2.0 [46] with default settings. For each gene, we provided the clade-specific composition matrix as a background. We filtered out repeats identified by fLPS2.0 if (a) they were shorter than a four amino acid stretch or (b) the repeat purity of the stretch was less than 70%, or (c) they consisted of more than four different amino acids, or (d) the repeat is not present in at least four species. This criterion ensured that we only focused on a conservative, reliable set of repeats. We removed repeats containing "X," corresponding to gaps in the sequence or annotation errors in the protein. The complete list of repeat locations is available in Supplementary  Table S3.

Comparison of orthologous sequences
We first excluded coding sequences with premature stop codons and subsequently removed genes with sequences in less than three species within each clade. We performed multiple sequence alignment of the coding sequences (codons) using the MUSCLE [47] aligner with 100 iterations. We mapped the location of protein repeats onto the aligned nucleotide coding sequences. Subsequently, we identified orthologous repeats based on positional overlap in the alignment. We performed a phylogenetic correction of the length for each orthologous repeat using the Phylogenetically Independent Contrasts (PIC) method implemented in the phytools [48] R [49] package. The most extreme values of the phylogenetically corrected repeat lengths for the internodes were considered significantly contrasting between the corresponding species groups. The group with a higher mean repeat length was expanded, while the other group was contracted between the contrasting species groups (see Supplementary Table S4). We filtered out those cases in which either of the contrasting species groups had less than three species to ensure reliability of the results (see Table 1).

Molecular evolutionary analysis
We identified lineage-specific positive selection using the aBSREL [50] program from the HyPhy [51] package. We also identified positively selected sites for each gene using codeml from the PAML [52] package by comparing models M7 and M8.

Characterization, visualization, and analyses
We did gene enrichment analysis using ShinyGO 0.76 server [53] with all immune genes as background gene set (see Supplementary Tables S1 and  S5). The secondary structures are made using Protter (version 1.0), and the transmembrane domains are predicted in PROTEUS2 (version 2.0) [54,55]. The protein structures of genes of interest are downloaded from Alphafold Protein Structure Database [56]. If the protein structure of a specific gene for a species is unavailable on Alphafold, we predicted the structure on the SWISS-MODEL [57] server. All the protein structures are visualized in ChimeraX [58] v1. 16. To test whether overlap of repeat regions with positively selected amino acids occurs more than chance, we used Bedtools [59] Fisher (v2.26.0). All other necessary analyses are done in R [49].

RESULTS
Protein repeats occur in a third of the immune genes Of the 2488 immune genes with well-annotated sequences across multiple species (Supplementary Tables S1 and S2), we found repeats in 823 genes. The Gene Ontology (GO) analysis of molecular functions of repeat-containing genes shows enrichment for molecular binding (Integrin binding, transcription factor binding, DNA binding, receptor binding, and kinase binding) ( Fig. 1A and Supplementary Table S5). Moreover, repeats are enriched in genes involved in histone methyltransferase complex, transcription regulator complex, and nuclear protein-containing complex ( Supplementary Fig. 1A). We noticed that repeats are also enriched in genes involved in organ development and regulatory processes of transcription by RNA polymerases ( Supplementary  Fig. 1B). To gain an in-depth understanding of the role of immunespecific repeats, we compared the GO enrichment of repeatcontaining immune genes with repeat-containing non-immune genes (Supplementary Table S5). Cellular components like histone methyltransferase and nucleoplasm are enriched for immunespecific genes with repeats. We found that of the 722 genes in adaptive immunity (GO: 0002250), 141 genes contain repeats. Of the 922 genes involved in innate immunity (GO:0045087), 198 have repeats (Fig. 1B). The proportion of genes with repeats among those unique to innate and adaptive immunity is close to twenty percent each.
Repeats are abundant in specific GC content genes We observed that most repeats occur at the beginning of the genes (initial 20% of the gene) or the terminal (final 10% of the gene). We also observed the GC content of the genes with repeats ranging between 42-65% ( Fig. 2A and Supplementary Table S6). To test whether there is a difference in GC content of repeatcontaining genes and genes without repeats, we compared the GC% of both classes for each clade. Our multi-clade dataset shows that the repeat-containing genes have significantly higher GC content than genes without repeats in each clade (Fig. 2B, Wilcoxon test for significance).
Immune genes may contain multiple repeats Several immune genes consistently have multiple orthologous repeats across clades (Fig. 3A). Some genes have up to five repeats, for example, in the case of ARID1B. We quantified the percentage of genes with different numbers of repeats (Supplementary Fig. 2A). More than 80% of the repeat-containing immune genes contain only a single repeat. We also quantified the overlap of positively selected sites with repeat regions using Fisher's exact test and found that 59 immune genes favor positively selected sites in repeat regions (Supplementary Table S5). Of these genes, SETD1A significantly preferred amino acid sites in repeats in three clades, of which Primates is shown in Fig. 3B. We also found that 2223 lineages across the clades contain repeats and are under positive selection (Fig. 3C). We shortlisted the genes for further analyses with the same orthologous repeat in at least six species in at least three clades. Of the shortlisted genes, we highlighted the repeats in the 3D structure of proteins for DHX9 and SETD1A genes in Homo sapiens (Fig. 3D, E).

Repeats show positional preference in gene
We classified the homopolymer repeats according to the chemical composition of their side chain into five different categories: nonpolar aliphatic, polar uncharged, positively charged, negatively charged, and nonpolar aromatic. We compared the proportion of repeat occupancy by the gene position and the abundance of each amino acid repeat type (Supplementary Fig. 2B). We noticed that nonpolar aromatic amino acids occur the least while nonpolar aliphatic is most abundant. Furthermore, we noticed most of the repeats appear at the beginning of the gene, but the negatively charged repeats appear less at the beginning and more towards the end.

Repeats are highly variable in length
Orthologous repeats can have species-specific or clade-specific changes in length. We hypothesized that longer genes might tolerate more variations in repeats than shorter ones. To test this, we correlated the gene and orthologous repeat lengths for all the species (Fig. 4A). We observed that repeat length correlates significantly, but weakly, with gene length (Kendall's rank correlation: τ = 0.0308, p < 2.2e-16, n = 40532). We also observed that most genes with more than ten thousand nucleotide gene sequences have smaller repeats of similar length (e.g., KMT2A, LRP1, LYST, PKHD1L1, and RNF213). Only two genes, PRG4 and HRG, show large length variation for orthologous repeats between different clades (380-2876 nucleotides for PRG4 and 123-867 nucleotides for HRG). We also visualized the repeats of HRG and PRG4 across all the clades (Fig. 4B, C). We noticed that orthologous repeats show length and composition variation between different species. HRG gene contains only one repeat across all the species considered. In Aves, the gene comprises a repeat of amino acid composition HP (histidine and proline). In contrast, the repeat primarily consists of amino acids GHP (glycine, histidine, and proline) in the rest of the clades. Moreover, the repeat is longer in Aves (237-867 nucleotides long) than in the rest of the clades (123-507 nucleotides long). Similarly, for PRG4, the repeat varies in length and composition across different clades (EPT and PEAT amino acid composition).

Repeats show lineage-specific length variation
Since the hypervariability of repeats is a source of rapid, adaptive variations in a short evolutionary time [27,32], we tested for significant length variation of orthologous repeats for each clade between species groups (Table 1). Furthermore, we quantified the number of immune genes for expansion and contraction of repeats for each species within their clade (Fig. 5A). In the Primates clade, out of 7 Hominoidea species considered in the study, six species show more genes under contraction of repeats.
In the Amphibia clade, Gymnophiona clade (three species considered: Microcaecilia unicolor, Geotrypetes seraphini, and Rhinatrema bivittatum) showed a higher number of genes with contraction of repeats compared to the Anura clade (Xenopus tropicalis, Bufo bufo, Nanorana parkeri, and Rana temporaria) ( Supplementary Fig. 3). Repeats have unstable nature and can vary in length within species of the same clade, potentially giving rise to lineage-specific evolutionary novelty. To test this, we shortlisted the genes which contain orthologous repeats across the clades and calculated the standard deviation of the length (Fig. 5B). CREB1 and RAB35 show no length variation between species of the  [22]. Of the genes with significantly contrasting repeat lengths between species groups (from Table 1), we shortlisted FASLG and C1QC as examples to understand the morphological/physiological implication of repeat length variation. The examples are selected based on the availability of protein structure for at least one species in either of the species groups compared and experimental evidence to show the effect of repeat modification. We modeled the protein structure for the other species group using the available template.
FASLG, a transmembrane protein-coding gene, shows lineagespecific length variation of repeat FASLG gene, a member of the tumor necrosis factor superfamily, contains a proline repeat. The proline-rich repeat shows a significant change in length in the Rodentia clade with expansion in Cricetidae (represented by Onychomys torridus, Peromyscus leucopus, Peromyscus maniculatus, Arvicola amphibius, Microtus ochrogaster, Cricetulus griseus, and Mesocricetus auratus) against Muridae (represented with Meriones unguiculatus, Mus caroli, Arvicanthis niloticus, Rattus norvegicus, and Rattus rattus) (Fig.  6A). The structure of the transmembrane gene, FASLG, contains the repeat in the cytoplasmic side of the protein (Fig. 6B). The protein structure visualization showed the repeat near an α-helix (Fig. 6C). Similarly, within Primates, the C1QC gene has a significant repeat length contraction of leucine repeat in newworld monkeys (12 amino acid long stretch) than the rest of the species (17 amino acids) ( Supplementary Fig. 4).

DISCUSSION
The rapid evolution of protein repeats is essential in immune genes like leucine-rich repeats in innate and adaptive immunity. This study focuses on the potential significance of repeat abundance, characterization, and length evolution in tetrapod immune genes. Our study found that around one-third of the immune genes contain repeats which are on par with previous studies of complete gene sets [13,18,60]. We also found that repeats are enriched in functions related to molecular binding, immune cells' differentiation, development, and function. Innate immunity primarily works based on molecular structure recognition unique to microorganisms [61] using pattern recognition receptors (PRRs). Each PRR of the host can bind to many molecules of a common structural pattern [62]. PRRs have several distinct classes, of which toll-like receptors (TLRs) are best characterized. TLRs are transmembrane receptors that recognize bacterial and viral products and activate the downstream immune response against it [63]. Like TLRs, node-like receptors (NLRs) are a wellcharacterized class of cytosolic receptors in immunity. At the Cterminus, all NLRs contain a leucine-rich-repeat (LRR) and a nucleotide-binding oligomerization domain (NOD) [64,65]. The NLRs recognize pathogenic particles and activate downstream pathways of the immune response. T-cell and B-cell receptors play Fig. 1 Repeat-containing immune genes specialize in specific functions. A GO enrichment of the immune system process genes for molecular functions. B Venn diagram showing the overlap of the number of genes with repeats to immune genes. Adaptive immune response genes (GO:0002250, 722 genes), innate immune response genes (GO:0045087, 922 genes), and immune system process genes with repeats (823 genes) are represented with dodgerblue, green4, and orchid-colored circles, respectively. There are 47 genes common in the immune system process, adaptive immune response and innate immune response genes. Fig. 2 Repeats are preferred in specific GC% range genes. A The 2D histogram of GC% and the frequency of occurrence of the genes with repeats at their relative positions along the gene. The figure shows more abundance of repeats within the GC% range at the beginning and end of the gene. B The boxplot shows significantly higher GC% in genes having repeats than genes not having repeats across all the clades. The x̄represents the mean, and the n is the total number of sequences used for comparison in that clade. a central role in antigen recognition in adaptive immunity. Processes like somatic hypermutation and recombination, nontemplated nucleotide addition, and gene conversion generate a diverse receptor repertoire to identify almost any antigen [66]. The presence of comparable proportions of repeats in both adaptive and innate immune response genes further emphasizes their ubiquitous role in both pathways.
A relationship between GC content and the occurrence of repeats in mammalian genes is well known. Nucleotide compositional constraints would expedite the formation of repeats in the GC-rich region of the gene [14,67,68]. Previous studies suggest positional preference of repeats in protein sequence, with most repeats occurring at the beginning or end of gene sequence and avoiding the middle region [69]. For example, leucine (PolyL) repeats occur either at the beginning or at the end of the gene sequence and have a very low frequency in the middle of the gene sequence. In contrast, PolyT frequently occurs in the middle region of the sequence [69]. We also noticed that most repeats in immune genes occur in the 40-65% GC content range with high frequency in the beginning and end of the gene sequence and very low frequency in the middle region. The positional preference could result from avoiding disorder in domain folds of the protein structure [70,71]. Genes with repeats have significantly higher GC % than those without repeats across all the tetrapod clades studied. This result suggests that GC content-rich genes have more possibility of acquiring a repeat. As GC-biased codons mainly encode repeats, they increase the overall GC content of the gene even further [68].
Many proteins lack a well-defined 3D structure and possess their functional unit in intrinsically disordered or unstructured regions [72][73][74][75]. The unstructured state facilitates increased interaction capacity, protein interaction networks, enhanced association rates, and adaptability to diverse partners [75]. These disordered regions or proteins are often generated by the variable length of internal repeat regions [76] and can potentially give rise to novel functions or protein variants [16]. Repeat regions play a crucial role in protein-protein interaction networks, especially in hub proteins (highly connected proteins), by providing flexibility, adaptability, and an extended area for interaction [75]. More repeats in an amino acid sequence facilitate more scope for functional diversification and the generation of new interaction networks in a short evolutionary time. Variable repeat regions provide genetic variation and an opportunity for selection to work on it [77]. The occurrence of positively selected sites in repeat regions suggests that repeats are malleable regions that provide functional diversity and possess functionally important sites for interaction with other proteins. Overlap of only around 10 percent between lineages under positive selection and those having repeat regions indicates that repeat regions help provide only length variability and may not lead to positive selection.
Longer genes may tend to have longer repeats and, hence, have more nucleotide length variation in orthologous repeats. This factor can influence the result of significant repeat length variation and possibly bias the overall results towards longer genes with repeats. We noticed that longer repeats and longer genes have a positive but weak correlation and that most longer orthologous repeats have similar lengths (e.g., RNF213, LRP1, KMT2A, LYST, and PKHD1L1). We observed the most length variation of orthologous repeats in PRG4 and HRG genes, possibly resulting from annotation/assembly artifacts.   5 Global patterns of repeat-length variability across clades. A The pie chart represents the frequency of expanded and contracted genes in the Primates clade. Red color represents the proportion of genes with expanded repeat lengths, and blue is the proportion of genes with contraction of repeat length. The brackets represent the number of significantly expanded and contracted genes for each species. B The heatmap of the standard deviation of the orthologous repeat lengths in genes across different clades. White-colored boxes represent the absence of the gene for that clade, the green color represents that all repeat lengths are the same in that clade, i.e., the standard deviation is zero, and the red represents a high standard deviation.
Histidine-rich glycoprotein, HRG, is a plasma glycoprotein that can regulate immune complex clearance, complement activation, phagocytosis of apoptotic cells, cell adhesion and migration, and modulate angiogenesis [78]. Human HRG contains two cystatinlike regions at the N-terminus, a histidine-rich region (HRR), a proline-rich region (PRR), and a C-terminal domain [79]. In conditions similar to tissue injury or tumor growth, like low pH or interaction with Zn 2+ , the histidine-rich region of the molecule enhances ligand binding [78,80]. We noticed variation in the composition of the repeat region of HRG among different clades (HP and GHP amino acid composition) with varying lengths (with the most extended repeat length observed in Aves). Variation in size and composition of HRG repeat across species reflects different evolutionary trajectories under varying selective constraints on the biological process of the gene.
As discussed earlier, variation in repeat length can give rise to functional and evolutionary novelty in a short time. Repeats undergo accelerated expansion once length variation is above a critical length, while minor repeat length variation is proposed to be functionally neutral [22]. Purifying selection constraints the repeat length in a specific range above which the length variation is deleterious [12,22]. We wanted to quantify the abundance of contraction or expansion of the number of repeat-containing immune genes in a species-specific manner within each clade. A bias towards either expansion or contraction will indicate a change in an overall gene-interaction network due to repeat length variation. We observed Hominoidea to have more genes under the contraction of repeats. Apes (Hominoidea), compared to Asian and African monkeys, exhibit increased sensitivity to immune stimulation by viruses and bacteria [81]. Apes also activate a broader array of defense molecules potentially beneficial for early pathogen killing [81]. Widespread repeat contraction of immune genes in apes might have contributed to the different immune responses between apes and monkeys. Similarly, in the Amphibia clade, Gymnophiona shows more genes under significant contraction than in the Anura clade. The number of species included in these two clades of Amphibia is limited and needs more data to test the consistency of the result.
In the case of strong purifying selection, the orthologous repeats will tend to maintain the same composition and length across distinct species [22,82]. Of the genes with the same orthologous repeats across clades, we randomly selected 11 genes to look for the standard deviation of repeat length. RUNX2, KAT6A, PRRC2C, and HAND2 show length variation between species, of which RUNX2 is a well-known example of morphological variation in vertebrates due to repeat length variation [31,32]. Notably, KAT6A also showed repeat length variation between species across clades. KAT6A helps regulate transcription and gene expression by acetylating the lysine-9 residue of histone 3. KAT6A is also known to participate with RUNX2 in osteogenesis [83,84]. Correlation and association of repeat length variation of RUNX2 and KAT6A can be a potential target for future studies of morphological variations.
Clade-specific expansion and contraction of a repeat in protein sequence can reflect the overall functional diversification of that protein's interaction network or structure. In the case of Rodentia clade, we observed expansion of proline-rich repeat of FASLG in Cricetidae compared to Muridae. FASLG is a transmembrane protein whose binding with its receptor induces apoptosis [85,86]. FASLG also participates in neural degeneration/regeneration [87] and T cell activation [88]. The cytoplasmic side of FASLG contains a distinct proline-rich region consisting of several SH3 binding domains flanked by arginine and lysine residues and two conserved casein kinase I (CKI) binding sites [88,89]. The proline-rich repeat is responsible for sorting Fas ligand (FasL) to secretory lysosomes [89], influencing T cell activation by reverse signaling [88]. Deleting the proline-rich domain in cells with secretory lysosomes leads to surface expression of FasL [89]. Costimulation and signaling through a T-cell receptor are necessary for optimal T-cell activation. Deleting 45-54 amino acids of the repeat leads to failure of costimulation [88]. The contraction/expansion of proline-rich repeat in two clades of Rodentia can potentially indicate an alteration in the interacting network associated with the repeat functioning.
Our in-silico analysis identifies several promising candidate genes whose repeat length has changed drastically between species or clades, rapidly generating evolutionary novelty. However, the practical implications of these changes are unclear. Hence, the immediate next step would be to investigate the consequences of these changes on the structure and function of these proteins. Changes in the repeat regions of genes can change interaction partners, structural conformation, the prevalence of post-translational modifications, protein stability, etc. Moreover, genes that contain more than one repeat have multiple substrates for rapid change in the protein sequence. Protein repeats in living organisms have survived selective constraints and persist due to beneficial or neutral functional consequences. Based on these surviving repeats, we identify malleable regions of immune genes. The availability of highthroughput genome editing technologies provides a new avenue to synthesize genes with protein repeats of various lengths and compositions [90,91]. Such synthetic biology approaches will be able to provide a comprehensive picture or fitness landscape of which regions of what genes can tolerate repeats and how such repeats affect their functions.
This study focused on clade-specific orthologous repeat length variation in a phylogenetically corrected framework in immune genes. We characterized and quantified the different types of repeats and their overlap with positively selected lineages and sites. Moreover, we enhance previously reported morphological/physiological consequences of repeat length variation across vertebrates and provide new candidates for future studies. Our study of the repeat region in immune genes highlights the importance of the repeat-rich domain in immunological processes. It suggests repeats as a potent source of the immense diversity of immune functions with a limited number of genes.

Limitations
A conservative list of~1500 genes are involved in immune response [92]. After including genes with other immune-related functions, a more comprehensive gene set (~2500 genes) can be considered immune genes. However, our current understanding of immune-related genes is far from complete and continues to improve with the availability of high-throughput assays [93,94]. Moreover, our analysis strategy requires removing species-specific genes and those with poor annotation and genome assembly challenges. Hence, our list of immune genes is not exhaustive and may have missed several genes with protein repeats that are difficult to assemble.
Similarly, the repeat sequence may be incorrect due to assembly errors and can confound the identification of repeat expansion and contraction. We consider 12 vertebrate clades for our analysis (see Supplementary Table S2) to have an extensive phylogenetic sampling. However, five of these clades (Afrotheria, Amphibia, Marsupials, Perissodactyla, and Testudines) have less than ten species with genome assemblies. Although the Aves clade has 56 species with genomes, the quality of these genomes is highly heterogeneous, and many genes are missing from the genome assembly. Hence, sparse sampling and missing data will likely limit our understanding of protein repeat evolution. Accurate, high-quality genome assemblies and availability of population-level genomic samples will provide a more detailed picture of protein repeat evolution. We use a phylogenetic correction approach that cannot distinguish between empirically significant and biologically relevant repeat changes. For instance, we do not consider within-gene positionspecific heterogeneity, amino-acid type, sequence saturation, repeat purity, etc., while shortlisting candidate genes. Overall, our criteria are stringent and may result in the exclusion of important genes.

DATA AVAILABILITY
All data associated with this study are available in the Supplementary Materials, and data and scripts used for analysis are provided in an easy-to-browse format: https:// github.com/ceglablokdeep/Immune_genes_repeats. A copy of the data has been uploaded on Mendeley datasets with https://doi.org/10.17632/8zxtjh8fjs.1.