Identification of lineage-specific gene family expansions in a database of gene families

Gene families specific to, or with significantly changed membership in, particular lineages compared to outgroups may reflect important lineage-specific changes in biology. Here we describe a computational protocol to identify gene families that vary greatly in gene count across a species tree. This protocol uses three different metrics to capture aspects of this variability, and calculates them for each family in an in-house database of gene families (e.g. built using the Ensembl Compara pipeline). One metric ( C v ) identifies families that vary a lot in gene count across the species tree, and the other two ( E max , Z max ) identify families that have an elevated gene count in a certain clade of the species tree. Our protocol controls for differences in gene counts due to fragmented assemblies.


Introduction
Gene families specific to, or with significantly changed membership in, particular lineages compared to outgroups may reflect important lineage-specific changes in biology. Here we describe a computational protocol to identify gene families that vary greatly in gene count across a species tree.
This protocol can be used to analyse families from an in-house database of gene families (e.g. built using the Ensembl Compara pipeline, Vilella et al 2009).
To identify gene families that vary greatly in gene count across a species tree, we use three metrics to capture aspects of this variability. However, to control for differences in gene counts due to fragmented assemblies (e.g. genes split across multiple contigs), we use summed protein length per species in a family as a proxy for gene counts in these metrics. The first metric, coefficient of variation (C v ), simply measures the variability in summed protein length (per species) in a family. The other two metrics, Z-score (Z max ) and enrichment coefficient (E max ), reflect whether there is a gene family expansion in a particular group of species, relative to the rest of the species tree. For this, species are classified into a set of non-overlapping species groups of interest that ideally correspond to monophyletic clades of the species tree (e.g. nematodes, flatworms, and arthropods in a species tree of animals).
The three metrics are defined as follows: where the numerator (s) is the standard deviation in the summed protein length per species in the family, and the denominator is the mean of the summed protein length per species in the family.
(ii) maximum Z-score, where T is the set of non-overlapping species groups, c is a species group in the set T. For each species group c in the set T, we calculate Z as the difference between the mean summed protein length (per species) in that species group c and the overall mean (the numerator), divided by the standard deviation in summed protein length per species among the species outside group c (the denominator). Z max is the maximum of these Z values over all tested species groups. Note that the standard deviation used here is for the species outside species group c, so that it is not affected by any gene family expansion that occurred in the family within the species group c. That is, for each species group in the set T, we calculate E as the ratio between the mean of the summed protein length (per species) in that species group and the mean summed protein length outside that species group. E max is the maximum of these E values over all tested species groups.
High values of the three metrics may be used to detect families with different patterns of variability.
High values of Z max and E max may indicate a gene family expansion in a particular clade of the species tree, but high values of C v may highlight cases that they miss, such as a gene family that has independently expanded in different clades of the species tree. In addition, a gene family that has expanded in a particular clade of the species tree can only have high values of Z max and E max if the gene family also includes genes from species outside this clade, whereas high values of C v can be used to detect large variability within a single clade even if the family lacks members from outside that clade.
The protocol has two steps. In Step A, the set of all input gene families is filtered to remove likely transposable element families, assuming that these are not of interest to most users, since expansions of transposable element gene families are relatively unlikely to contribute to lineagespecific changes in biology.

In
Step B, we define a set of non-overlapping species groups, which are ideally monophyletic clades of the species tree (e.g. nematodes, flatworms), in which we wish to identify lineage-specific expansions (e.g. nematode-specific expansions, flatworm-specific expansions). In Step B, we also define a subset of the species as having high-quality assemblies. Then, the three metrics C v , E max and Z max are calculated for each family in an in-house database of gene families. Families with high values of each metric are good candidates for families with lineage-specific expansions, and can be manually examined in follow-up analyses.
Our protocol controls for differences in gene counts due to fragmented assemblies, firstly by using summed protein length per species in a family as a proxy for gene count, and also by calculating the three metrics based only on those species that have high-quality assemblies (as defined in Step B).
A related approach for finding lineage-specific gene family expansions is to identify Pfam (Finn et al 2014) domains that are enriched in a certain species group compared to the rest of the species tree, and then to identify the expanded gene families containing that domain (e.g. see Tsai et al 2013). In contrast, our current protocol does not require that the genes in an expanded family have any predicted protein domains, so can detect expansions in uncharacterised or relatively poorly characterised gene families.
A more similar approach to the current protocol was used to identify Spirometra lineage-specific expansions by Bennett et al 2014, by ranking gene families by the ratio of the total cumulative length of encoded Spirometra erinaceieuropaei genes in a family to the cumulative length of the corresponding Echinococcus multilocularis genes in the same family. This is a very similar idea to E max , but E max for a family is extended to be a ratio between the mean summed protein length in a species group, compared to the mean summed protein length outside that species group. This should be a more accurate metric for detecting expansions because it can include information from more 5 species (e.g. several Spirometra-lineage species and several Echinococcus-lineage species).

Reagents
Ensembl Compara pipeline software Pfam InterProScan Equipment Computer cluster.

Procedure
Step A: filtering gene sets and gene families for transposable element genes The input to Step A is a gene prediction set (and the corresponding predicted protein sequences) for each species of interest; and a database of gene families containing all the genes from all those species, e.g. built using the Compara pipeline (Vilella et al 2009).

1.
Interproscan 5  The output from Step A is the set of gene families from which likely transposable element gene 6 families have been removed.
Step B: calculating metrics for identifying highly variable gene families The input to Step B is the set of non-transposable element gene families output from Step A above, and a species tree for the full set of species.
1. The set of all species is classified into a set of non-overlapping subsets. We refer to these subsets as 'species groups', and they are the subsets in which we wish to identify gene family 2.
If the set of all species includes some species with relatively high-quality assemblies and some species with much more fragmented assemblies, then a subset of species that have high-quality assemblies, and span the major clades of the species tree, are designated 'tier 1' species. The tier 1 species should ideally be those with relatively contiguous assemblies (e.g. N50/scaffoldcount > 5) and relatively complete proteomes (e.g. CEGMA (Parra et al 2007) partial score > 85%).

3.
The metrics C v , Z max and E max are calculated for each of the gene families. To increase the reliability of these metrics, only the tier 1 species are considered, since these species have the highest quality assemblies, and therefore most complete proteomes and fewest artefactual gene splits and merges. When calculating the means and standard deviations that contribute to these metrics, the 'tier 1' species that lack any genes (in a family) are taken into account, as follows. First, to calculate the mean and standard deviation for a whole family (e.g. for C v ), we identify the node of the species tree corresponding to the root of the gene tree for the family.
Then, for each tier 1 descendent species that is present on the species tree but not in the current gene family, we count its summed protein length as zero when calculating the mean and standard deviation. Similarly, when calculating the means and standard deviations for a particular clade (for Z max or E max ), the node of the species tree corresponding to the root of the clade is identified, and all tier 1 species that descend from that node of the species tree are taken into account.
The code for calculating metrics is tied to our in-house infrastructure, but is available for perusal at tinyurl.com/comparaFamiliesAnalysis-py The output from Step B is the C v , Z max and E max values for each of the families.

Troubleshooting
Because at least two genes are needed to calculate the C v , Z max and E max metrics for a family, and only tier 1 species are considered when calculating them, these metrics are undefined for families containing less than two genes from tier 1 species. Furthermore, in very small families, the estimates of C v , Z max and E max are very noisy (because the underlying estimates of mean and standard deviation are noisy), so we recommend discarding families having few (e.g. <10 genes) from tier 1 species.
It is possible that some of the families with relatively high values of C v , Z max or E max may turn out to be undetected transposable element genes; or families that have gene counts that are equally as high (or nearly as high) in outgroup species as in the species group of interest. Expansions of transposable element gene families, or expansions specific to outgroup species, are unlikely to be interesting to the user, so we suggest that users manually examine the families with high values of C v , Z max or E max to discover and possibly discard such cases. In Step B, species are classified into a set of non-overlapping species groups of interest in which we want to find lineage-specific expansions (e.g. nematodes, flatworms). These ideally correspond to monophyletic clades of the species tree. In order to detect lineage-specific expansions that occurred in narrower phylogenetic groups, the user may want to run the protocol again with narrower species group definitions (e.g. nematode clades I, II, III, IV, V, and cestode flatworms, trematode flatworms).

Anticipated Results
The output from the protocol is the C v , Z max and E max values for each of the families. In order to filter the list of families to identify those families that have the most striking patterns of variability, we suggest taking the union of the top 500 highest-scoring families according each of the metrics (C v , 8 Z max and E max ), and then plotting phylogenetic trees for each of these families, and examining the plots by eye. The cutoff of 500 suggested here is arbitrary and users should investigate the effect of varying this cutoff. Definition of Emax