A site-and-branch-heterogeneous model on an expanded dataset favor mitochondria as sister to known Alphaproteobacteria

,


Introduction
Mitochondria stem from an ancient endosymbiosis that occurred during the origin of eukaryotic cells 1 .As a result, all extant eukaryotes have mitochondria or evolved from mitochondrion-bearing ancestors [1][2][3] .Some hypotheses have it that mitochondria provided excess energy required for the origin of eukaryotic complexity 4 , whereas others suggest that mitochondrial symbiosis brought efficient aerobic respiration into a more complex proto-eukaryote 5 .The nucleocytoplasm of eukaryotes is now known to be most closely related to Asgard archaea [6][7][8] .Mitochondria, on the other hand, have been known for decades to be phylogenetically associated with the Alphaproteobacteria 9,10,1 .However, the precise relationship between mitochondria and the Alphaproteobacteria, or any of its sub-groups, has been elusive and remains a matter of intense debate (e.g., see 11,12 ).Settling this debate will provide insights into the nature of the mitochondrial ancestor and the ecological setting of its endosymbiosis with the host cell 1 .
These alternative proposals suggested that mitochondria may have evolved from either streamlined or metabolically versatile free-living alphaproteobacteria [22][23][24][25] .Most recently, the phylogenetic placement of mitochondria has been vividly debated 11,12 .One study found mitochondria as a sister group to the entire Alphaproteobacteria (i.e., the Alphaproteobacteria-sister hypothesis) 11 .This conclusion was supported by the inclusion of novel alphaproteobacterial metagenome-assembled genomes (MAGs) from worldwide oceans, and by decreasing compositional heterogeneity through site removal.However, a subsequent study argued that removing compositionally heterogeneous sites from alignments might lead to the loss of true historical signal 26,12 .The authors of the latter study, instead, used a taxon-removal and -replacement approach, and concluded that mitochondria branch within the Alphaprotoebacteria as sister to the Rickettsiales and some environmental metagenome-assembled genomes (MAGs) 12 .
There are several reasons why it is difficult to confidently place mitochondria among their alphaproteobacterial relatives.First, the evolutionary divergence between mitochondria and their closest bacterial relatives is estimated to have occurred >1.5 billion years ago 27,28 .This has erased the historical signal (e.g., through multiple amino acid replacements) that was originally present in the few genes that mitochondria and alphaproteobacteria still share.Second, the Alphaproteobacteria is under sampled and most of its diversity remains to be discovered, as suggested by recent metagenomic surveys 11 .Third, and perhaps most problematic, the genomes of some lineages in the Alphaproteobacteria and those of mitochondria have undergone convergent evolution.For example, the Rickettsiales and Holosporaceae (intracellular bacteria) 29 , or the Pelagibacterales and 'Puniceispirillaceae' (planktonic bacteria) 30 , have reduced or streamlined genomes with compositionally biased genes similar to those of mitochondria.The genes and genomes of these taxa are biased towards A+T nucleotides (and their proteins towards F, I, M, N, K, and Y amino acids) in contrast to other groups that have not evolved reductively (which might be biased towards G+C nucleotides and G, A, R, and P amino acids) 29 .This sort of compositional heterogeneity is often the cause of artefactual attractions among lineages with similar compositional biases in phylogenetic inference 31 .
To cope with the aforementioned sources of phylogenetic errors, we developed and implemented a new phylogenetic model of protein evolution that accounts for compositional heterogeneity across both alignment sites and tree branches.Moreover, we also gathered an expanded set of 108 proteins of alphaproteobacterial origin in eukaryotes (in comparison to <67 previously available) and assembled more than 150 non-marine alphaproteobacterial MAGs from microbial mat, microbialite, and lake sediment metagenomes.We combined these improvements to explore and dissect the phylogenetic signal for the origin of mitochondria present in both modern eukaryotes and alphaproteobacteria.

Results
To date, most studies that aimed to phylogenetically place the mitochondrial lineage have relied exclusively on mitochondrion-encoded protein datasets that range from 12 to 38 proteins [16][17][18]32,11,12 . These makers are not only few (e.g., 24 genes and 6,649 sites in 11 ) but tend to be compositionally biased because most mitochondrial genomes are rich in A+T.The only set of nucleus-encoded proteins of mitochondrial origin published thus far comprises 29 proteins 19,20 .
To expand the number of proteins for placing the mitochondrial lineage, we systematically surveyed both nuclear and mitochondrial proteomes.After a multi-step phylogenetic screening, we identified 108 marker proteins of alphaproteobacterial origin in eukaryotes.Of these, 64 are exclusively nucleus-encoded, 27 are both nucleus-and mitochondrion-encoded, and 17 are exclusively mitochondrion-encoded proteins (Fig. 1A, Fig. S1).Our expanded dataset comprises most marker proteins previously identified 11,19,20 and adds 56 new ones (Fig. S1).Functional annotations show that these proteins have diverse functions within mitochondria (Fig. 1B, Table S1).Most are involved in energy metabolism (e.g., respiratory chain complex subunits) and protein synthesis (e.g., ribosomal subunits) (Fig. 1B, Table S1).The fact that all these proteins have mitochondrial functions strengthens the view that the genes that encode them were transferred from (proto-)mitochondria to nuclear genomes and are therefore not secondary lateral transfers to eukaryotes.The new nucleus-encoded proteins also tend to have much less variable and biased amino acid compositions in comparison to those which are mitochondrion-encoded and some that are both nucleus-and mitochondrion-encoded (Fig. 1A).Similarly, nucleus-encoded proteins also have a broader range of G A R P/F I M N K Y amino acid ratios of 0.70-1.95,whereas mitochondrion-encoded proteins have a range of 0.25-0.77which suggests that they are much more compositionally biased towards F I M N K Y amino acids (and their genes towards A+T).The expanded set of nucleus-encoded genes are expected to increase phylogenetic signal by virtue of increasing the amount of data, and also introduce potentially less compositionally biased sequences that could otherwise cause phylogenetic artefacts.
Most studies have exclusively relied on genomes of cultured alphaproteobacteria (e.g., [18][19][20]32 ). Onl one recent study incorporated novel alphaproteobacterial MAGs from metagenomes sequenced by the Tara Oceans project 11 .So far, all of these alphaproteobacterial MAGs came from oceanic open waters and tend to be small and A+T-rich 11 .Moreover, none of them appeared to be most closely related to mitochondria to the exclusion of other alphaproteobacteria 11 .
Most of the new MAGs, which are widely distributed across the Alphaproteobacteria tree, also appear to encode an almost-complete set of bacteriochlorophyll biosynthesis enzymes which suggest that they come from photosynthesizers in the diverse environments sampled (e.g., microbial mats; Fig. 1D, Table S3).L: Ribosome translational factor; M: rRNA modification/maturation; N: tRNA modification/maturation; O: Aminoacyl-tRNA synthetase; P: RNA polymerase; Q: Branched-chain amino acid/fatty acid metabolism, R: Pyrimidine biosynthesis; S: Ubiquinone biosynthesis; T: Protein import/export; U: Iron-sulfur cluster biogenesis; V: Clp protease complex subunit; W: Proteasome-like complex subunit; X: Mitochondrial division (see also Table S1).(C) Phylogenetic tree of 154 novel MAGs reported here, the 45 MAGs reported by Martijn et al. (2018), and 1,188 of maximally diverse alphaproteobacterial genomes in GTDB r89 database.Taxon sample reduction was done with Treemmer 41 and phylogenetic inference with IQ-TREE (-fast mode) and the LG4X model (120 GTDB-Tk marker genes; 14,048 amino acid sites).(D).
Phylogenetic tree for the 154 alphaproteobacterial MAGs reconstructed from diverse metagenomes sequenced in this study and summary of major features for each MAG.Tree was inferred with IQ-TREE (fast mode) and the LG4X model after having removed 50% of most compositionally heterogeneous ɀ sites (120 GTDB-Tk marker genes; 7,024 amino acid sites) (see also Table S3).
To address recent controversies 11,26,12 , we first assembled the largest dataset to date that includes a new set of 64 nucleus-encoded and 44 mitochondrion-encoded proteins (108 genes in total and 33,704 amino acid sites; see above).Our dataset also comprised a wide taxon sampling with twelve mitochondria from diverse eukaryotes (from most 'supergroups'), and a broad set of 104 alphaproteobacteria that covered all major known lineages and maximized phylogenetic diversity (subsampled from a set of more than 3,300 genomes; see Methods).Importantly, our dataset incorporated several Rickettsiales species that have short branches and are less compositionally biased (Fig. 1D, Fig. S2, Table S4), as well as novel representatives of the MarineProteo1 clade (Fig. 1D, Fig. 2A, Table S4).Instead of relying on Beta-, and Gammaproteobacteria as outgroups (as in 11,12 ), we used the much closer Magnetococcia which has been consistently found to be sister to all other alphaproteobacteria (e.g., 11,12,20 ).This was done to decrease potential artefactual attractions between the long mitochondrial branch and distant outgroups, a concern raised before 11,26,12 .Furthermore, we also removed sites estimated to have undergone functional divergence at the origin of mitochondria (these represented only 5.2% of all sites) using the FunDi mixture model 42 .This was done to reduce potential artefacts from model misspecification as no phylogenetic model currently available adequately captures such patterns of functional divergence in proteins.
We first analyzed our dataset using the MAM60 site-heterogeneous model that was specifically inferred from our own dataset-this model has been shown to have a better fit than generic site-heterogenous models (e.g., C10-60) 43 .Analyses on the untreated dataset (i.e., without compositionally heterogeneous sites removed) placed mitochondria as sister to all of the Alphaproteobacteria with maximum support, i.e., both the monophyly of the Alphaproteobacteria and the Alphaproteobacteria-mitochondria clade were fully supported (Fig. 2A).However, these analyses also recovered the grouping between the Pelagibacterales, Holosporaceae, and other long-branching species (Fig. 2C, Mendeley Data) that, in previous work 29 , were shown to artefactually attract each other because of similar amino acid compositional biases.A common strategy for dealing with compositional heterogeneity in the absence of site-and-branch-heterogeneous models is to remove alignment sites based on metrics that quantify their compositional heterogeneity 11,12,29 .The progressive removal of the compositionally most heterogeneous sites according to the ɀ and χ 2 metrics 11,29,44 disrupted compositional attractions and showed clear support for the Alphaproteobacteria-sister hypothesis (Fig. 2B, Fig. 2C).
Because nucleus-encoded and mitochondrion-encoded proteins display different amino acid compositional patterns (Fig. 1A), we also analyzed these two protein sets separately.Whereas nucleusencoded proteins unambiguously supported the Alphaproteobacteria-sister hypothesis across all analyses (Mendeley Data), the mitochondrion-encoded proteins showed decreased support for this hypothesis as compositionally heterogeneous sites are removed (Fig. S3, Mendeley Data).However, no alternative hypothesis was favored and any placement of mitochondria among the Alphaproteobacteria was unsupported for mitochondrion-encoded proteins (Fig. S3; Mendeley Data).This suggests that mitochondrion-encoded proteins may have a more equivocal phylogenetic signal.Unlike in many previous studies 19,20,12,11 , we did not find support for the Rickettsiales-sister hypothesis in any of our analyses (Mendeley Data).S5).The taxonomic labels follow the higher-level taxonomy outlined in 29 .
Thickened branches represent branch support values of >90% SH-aLRT and >90% UFBoot2+NNI.(C) Variation in support values for the placement of mitochondria outside of the Alphaproteobacteria (SH-aLRT and UFBoot2+NNI) throughout the progressive removal of compositionally heterogenous sites according to the ɀ and χ 2 metrics.Support for the branch that groups mitochondria with all alphaproteobacteria (but excludes MarineProteo1 and the Magnetococcia) is always maximum (i.e., 100% SH-aLRT /100% UFBoot2+NNI; Mendeley Data).(D) Heatmap table summarizing the differences in log-likelihoods (lnL) relative to the highest log-likelihood for several alterative placements of mitochondria (A1-14 and B1-B12 in (A) and (B); see Table S6 and Fig. S4 for all tree topologies) under a conventional site-heterogeneous model (MAM60) and our new site-and-branch-heterogeneous model (MAM60+GFmix).Models (rows) are arranged in increasing order (from top to bottom) according to lnL values.For each model (row), tree topologies (columns) are arranged in increasing order (from left to right) according to lnL values.Absolute log-likelihood values for each tree (A-T) under the different models tested are reported within parentheses.For all four models, all topologies other than the maximum-likelihood tree were rejected with p-values of < 0.0001 according to Bonferroni-corrected χ 2 tests.See Table S6 for all tree topologies and datasets tested.
All studies to date have exclusively relied on either site-homogenous or purely site-heterogeneous models (e.g., CAT in PhyloBayes or C60 in IQ-TREE) 11,12,[14][15][16][17][18][19][20]22,23,32 . Indeed, o tractable model that accounts for compositional heterogeneity across branches and sites simultaneously is available; current branch-heterogeneous models cannot be combined with site-heterogenous models 31 , or are too computationally intensive and suffer from convergence problems 45,46 .To overcome these shortcomings, we developed a model that captures the most important compositional heterogeneity in alphaproteobacterial genomes-namely the variation in the G A R P/F I M N K Y amino acid ratio that is driven by variation in G+C vs. A+T nucleotide content (see 29 ).Our new branch-heterogeneous model, GFmix, models the variation in the ratio of G A R P/F I M N K Y amino acid frequencies across the phylogenetic tree in combination with conventional site-heterogeneous models (e.g., C10-60, MAM and UDM models).Briefly, this model requires a rooted tree, and introduces a new parameter that represents the G A R P/F I M N K Y ratio for every branch in a tree that is based on the amino acid compositions of all taxa that descend from that branch (see Materials and Methods for details).These parameters, in turn, adjust the frequencies of each site class in the site-profile mixture model resulting in a new transition rate matrix, Qc, for each mixture class for the given branch.We developed and implemented the new GFmix model in a maximum likelihood framework.
To further test the phylogenetic placement of mitochondria, we used the MAM60+GFmix model to estimate log-likelihoods on two sets of fixed trees (Fig. 2A, Fig. 2B, Fig. S4).The first tree set was inferred from the untreated dataset (108 genes, 33,704 sites), whereas the second tree set was inferred from a compositionally homogenized dataset through site removal (108 genes, 16,029 sites); the latter dataset minimized the differences of G A R P/F I M N K Y amino acid ratios among taxa (Table S5).(Both tree sets were inferred using the MAM60 site-heterogeneous model; see above.)We then varied the position of mitochondria along all backbone branches on each fixed tree (Fig. 2A, Fig. 2B, Fig. S4).Furthermore, we also grouped proteins into partitions according to distances calculated based on their G A R P/F I M N K Y compositional disparity (Fig. S5).Our analyses show that likelihoods estimated under the MAM60+GFmix model improved significantly when compared to conventional site-heterogeneous models (Fig. 2D, Table S6, likelihood ratio test (LRT) p-value = 0); model fit was improved even more when the proteins were grouped into ten separate partitions according to G A R P/F I M N K Y compositional disparity (Fig. 2D, Table S6, LRT p-value = 0).Importantly, the partitioned MAM60+GFmix model clearly favours trees that display the Alphaproteobacteria-sister relationship and where the grouping of longbranching and compositionally biased taxa (e.g., Pelagibacterales, Holosporaceae) is disrupted (i.e., those trees recovered from compositionally homogenized datasets through ɀ site removal; Fig. 2D, Table S6).This suggests that the removal of ɀ sites effectively decreases overall compositional heterogeneity and potential artefacts.
The top three trees often favored by the MAM60+GFmix model (i.e., those with the highest likelihoods) have mitochondria in adjacent branches: Alphaproteobacteria-sister (trees A11 and B9 in Fig. 2A and Fig. 2B), Rickettsiales-sister (trees A5 and B4 in Fig. 2A and Fig. 2B), and mitochondria as sister to all alphaproteobacteria except the Rickettsiales (or Caulobacteridae-sister; trees A10 and B8 in Fig. 2A and Fig. 2B) 29,47 .However, Bonferroni-corrected χ 2 topology tests show that the optimal trees that display the Alphaproteobacteria-sister relationship are significantly better than all trees with other positions for mitochondria (see Fig. 2D).Even though the Alphaproteobacteria-sister relationship is also favored by the MAM60+GFmix model for the mitochondrion-encoded protein dataset, the Caulobacteridae-sister relationship cannot be rejected by the Bonferroni-corrected χ 2 tests (i.e., p-values > 0.05; Table S6).This further supports the notion that the phylogenetic signal for the placement of mitochondria is weaker in mitochondrion-encoded proteins (see above).The Rickettsiales-sister relationship is rejected for all datasets and models (p-value < 0.005; Table S6).Overall, most of our distinct phylogenetic approaches show support for the Alphaproteobacteria-sister hypothesis.

Discussion
We have found significant support for the Alphaproteobacteria-sister hypothesis that has the mitochondrial lineage as the closest sister to all currently sampled alphaproteobacteria.Our findings thus conflict with the recent suggestion that mitochondria may branch within the Alphaproteobacteria as sister to the Rickettsiales 12 .Indeed, we believe that the design of the study by Fan et al., (2020) was particularly prone to artefacts.In an effort to choose less compositionally biased (i.e., G+C-rich) species for mitochondria and the Rickettsiales, these authors inadvertently selected species that are more divergent than most members of their respective groups.For example, the inclusion of mitochondria of flowering plants led to a considerably long stem branch for the mitochondrial lineage (see their Fig.S31-48).
Similarly, Anaplasma, Neorickettsia, and Wolbachia (Anaplasmataceae) are among the longest branches in the Rickettsiales (see their Fig.S50; see also our Fig.S2).All these species are secondarily, and not ancestrally, less compositionally biased, i.e., they evolved from species with A+T-rich genomes.
Moreover, their analyses were based on a rather small dataset that comprised only 18 or 24 mitochondrion-encoded genes (5,583 and 6,643 sites, respectively) and fewer than 41 taxa.These factors may, in combination, have led to the inference of poorly supported trees (e.g., see their Figs.S31-40), and an artefactual attraction between mitochondria, the Rickettsiales, and the FEMAG I and II groups (i.e., Fast-Evolving MAG I and II; see their Fig. 4).
Several previous studies have suggested that mitochondria were either sister to the Rickettsiales [18][19][20] or phylogenetically embedded in a larger group comprised of both the Rickettsiales and the Holosporaceae 20 .These hypotheses implied that the mitochondrial ancestor may have been an intracellular parasite: throughout its early evolution, the ancestor of mitochondria changed its function from an energy parasite to an ATP-producing respiratory organelle [18][19][20][21] .The finding that mitochondria are no longer phylogenetically associated to the Rickettsiales and are instead sister to the entire Alphaproteobacteria clade makes a parasitic origin of mitochondria less plausible.However, the nature of the mitochondrial ancestor remains poorly constrained.Future studies on species of the MarineProteo1 clade might shed some light on the early evolution of the Alphaproteobacteria, and possibly also on the mitochondrial ancestor.However, we note that the MarineProteo1 clade is separated by a long branch from the Alphaproteobacteria and mitochondria.Currently available genomes for the MarineProteo1 clade are relatively small, but not necessarily compositionally biased, and suggest that these alphaproteobacteria might be reduced and physiologically specialized (Fig. S2, Table S4).
Unravelling the deep evolutionary history of mitochondria is an inherently hard phylogenetic problem.One of the main challenges is to properly account for the drastically different compositional biases across anciently diversified lineages 29 .Here, we have moved towards overcoming this major obstacle.Our newly developed and implemented site-and-branch-heterogenous model allowed us, for the first time, to test different phylogenetic placements for mitochondria relative to the Alphaproteobacteria while accounting for the drastic amino acid compositional changes that alphaproteobacterial and mitochondrial proteins have undergone.A consilient view emerges from the combination of modelling and reducing compositional heterogeneity: the Alphaproteobacteria-sister hypothesis is robust and unlikely to be artefactual.However, we caution that the phylogenetic signal preserved in mitochondrion-encoded proteins is weak and ambiguous.The recovery of the Rickettsiales-sister relationship in previous studies 11,12 may thus be result of ambiguous phylogenetic signal and long-branch attraction.Therefore, we suggest that it is currently best to view mitochondria as an early offshoot of the alphaproteobacterial lineage that diverged just prior to the diversification of known extant groups.This is suggested by the short internal branch lengths between mitochondria and Alphaproteobacteria (see Fig. 2A, Fig. 2B) and is supported by the shared presence of the Mitochondrial Contact Site and Cristae Organizing System (i.e., a Mitofilin domain-containing Mic60) in only mitochondria and the Alphaproteobacteria, but not in members of the Magnetococcia and MarineProteo1 48,49 (Fig. S2, Table S4).Future efforts should focus on exploring diverse environments for unknown and extant alphaproteobacterial lineages that may be more closely related to mitochondria.

Metagenomic sequencing and MAG assembly
Samples collected from (1) microbial mats in the Salada de Chiprana (Spain, December 2013), Salar de Llamara 50 , Lakes Bezymyannoe and Reid (Antarctica, January 2017) and several hot springs around Lake Baikal (Southern Siberia, July 2017), (2) microbialites in Lake Alchichica 51 , and (3) sediments in Lake Baikal, were fixed in ethanol (>70%) in situ and stored at -20°C as previously described 50 .Total DNA was purified from samples using the DNeasy PowerBiofilm Kit (QIAGEN, Germany) by following the manufacturer's guidelines.DNA extracted from microbialite fragments was further cleaned using the DNeasy PowerClean Cleanup Kit (QIAGEN, Germany) as previously described 52 .DNA was quantified using Qubit®.DNA library preparation and sequencing were performed with an Illumina HiSeq2000 v3 (2x100 bp paired-end reads) by Beckman Coulter Genomics (Danvers, MA, USA), and with an Illumina HiSeq2500 (2x125 bp paired-end reads) by Eurofins Genomics (Ebersberg, Germany).A summary of the metagenomic libraries sequenced can be found in Table S2.
Raw Illumina short reads from all sequenced Illumina paired-end libraries were quality-assessed with FastQC v.0.11.7 and quality-filtered with Trimmomatic v.0.36 53 .Libraries made from samples from Lake Alchichica and the Llamara saltern were processed with the following workflow.Libraries were individually assembled, and technical replicates co-assembled (Table S2), with metaSPAdes v.3.10.0 54 .Contigs smaller than 2,500 bp in the (co-)assemblies were removed.Filtered reads were then individually mapped onto each assembly with Bowtie2 to obtain contig coverages 55 .Contigs were binned using MaxBin v.2.2.2 which relies on differential coverage across samples, tetranucleotide composition and single-copy marker genes 56 .The completeness and contamination of the bins reported by MaxBin v.2.2.2 were assessed with CheckM v.1.0.12 57 .Genome bins that were phylogenetically affiliated to the Alphaproteobacteria based on the manual examination of the CheckM reference genome tree (itself based on the concatenation of 43 marker genes) were retained.Reads were then individually mapped onto each alphaproteobacterial genome bin with Bowtie2.All paired and unpaired reads that successfully mapped to the alphaproteobacterial bins were subsequently co-assembled with metaSPAdes.The resulting co-assembly was processed through the Anvi'o metagenomic workflow 58 .In brief, reads were mapped to the final metaSPAdes co-assembly with Bowtie2 to obtain contig coverage values.DIAMOND searches 59 of predicted proteins against the NCBI GenBank nr database were done to assign taxonomic affiliations to each contig.CONCOCT2 60 , implemented in the Anvi'o suite, was used to bin the resulting metagenome.
Contigs were organized according to the composition and coverage by anvi-interactive.The predicted CONCOCT2 bins were visualized and manually refined based on their composition, coverage, taxonomy and completeness/redundancy. Libraries made from samples from Antarctica, the Chiprana saltern and Lake Baikal were processed with the following workflow.Libraries from the same location or environment type were co-assembled with MEGAHIT v.1.1.1 61 .Contigs smaller than 2,500 bp in the co-assemblies were removed.Filtered reads were then individually mapped onto each co-assembly with Bowtie2 to obtain contig coverages.Contigs were binned using three different binners (MetaBAT v.2.12.1 62 , MaxBin 2.2.4 56 , CONCOCT2 60 ) and their results were combined into consensus contigs bins with DAS Tool v.1.1.0 63.

Marker protein selection
We built an expanded dataset of mitochondrion-and nucleus-encoded proteins of alphaproteobacterial origin in eukaryotes.For the nucleus-encoded proteins, BLAST 64 similarity searches of all proteins contained in the predicted proteomes of 13 representative eukaryotes were conducted against a database of 176 prokaryotes (136 bacteria and 40 archaea).BLAST hits were clustered into homologous families with a custom Perl script, aligned with MAFFT and the L-INS-I method 65 , and then trimmed with BMGE 66 .Phylogenetic trees for each homologous gene family were inferred under the LG model in RAxML v.8 67 .These trees were then sorted based on the criterion that eukaryotes form a clade with alphaproteobacteria.Manual inspection of the trees then followed to remove paralogs and contaminants.
For mitochondrion-encoded genes, mitochondrial clusters of orthologous genes (MitoCOGs) 68 that are widespread among eukaryotes were used.
Both mitochondrion-, and nucleus-encoded candidate marker proteins were then compared through BLAST searches against those reported previously by Wang and Wu (2015) 20 and Martijn et al., (2018) 11 .
Our dataset encompassed most proteins from these other datasets, with few exceptions.The nonredundant and remaining candidate marker proteins comprising the union of these five datasets, were then further screened phylogenetically.Using a representative eukaryotic (mitochondrial) query for each marker gene, BLAST searches were done against a database that comprises 107 diverse bacteria (representing 27 cultured phyla) and 23 diverse eukaryotes (representing 6 major groups); eukaryotes were selected based on the availability of both mitochondrial and nuclear genomes or transcriptomes (see Table S7).Homologues were aligned with MAFFT, alignments trimmed with TrimAl 69 and singleprotein trees inferred with IQ-TREE 70 .The single-protein trees were inspected visually to remove duplicates, paralogues, and any other visual outlier such as extremely divergent sequences.Singleprotein trees were then re-inferred from the curated alignments and visually inspected.Proteins for which trees showed a sister relationship between eukaryotes and alphaproteobacteria were kept for further analyses.Finally, these candidate marker proteins were annotated and further refined using the EggNOG database and BLASTp searches.The final marker proteins set comprised 108 genes, 64 of which are exclusively nucleus-encoded, 17 are exclusively mitochondrion-encoded, and 27 are both mitochondrionand nucleus-encoded (Fig. S1).The annotations confirm that all marker proteins are predicted to be localized to mitochondria in eukaryotes (Table S1).

Dataset assembly
To increase taxon sampling as much as possible, MAGs reported in Anantharaman et al., (2016)   39 were added to those reconstructed here (see Metagenomic analyses).To improve the quality of our MAG selection, MAGs were analyzed with the CheckM lineage workflow and those with quality values (completeness -5x contamination) lower than 50 were discarded, just as done before by Parks et al., (2017, 2018)  39,40 .MAGs were then filtered according to their taxonomic affiliation to the Alphaproteobacteria.A phylogenetic tree for all MAGs and all Proteobacteria taxa in the GTDB r89 database 40 was inferred from 120 marker proteins, built-in in the GTDB-Tk software, using IQ-TREE v.1.6.10 70 and the LG4X+F model.To increase phylogenetic accuracy, a second tree was inferred with the LG+PMSF(C60)+G4+F using the LG4X tree as guide.All MAGs that fell within the Alphaproteobacteria clade in the GTDB-Tk tree were chosen for subsequent analyses.Together, these added up to more than 3,300 alphaproteobacteria.In order to reduce computational burden, Treemmer v.0.1b was then used to reduce the number of alphaproteobacterial taxa from the GTDB-TK tree while maximizing phylogenetic diversity 41 .The Treemmer analysis was constrained so representatives from major clades, as visually identified, were retained.Finally, a set of reference alphaproteobacteria (formally described species) were added, and long-branching alphaproteobacteria were replaced by short-branching relatives.
To retrieve homologues, PSI-BLAST searches with either one, two, or three iterations using representative mitochondrial (eukaryotic) query sequences for each marker protein were done against a database that comprised all carefully selected predicted proteomes.To remove non-orthologous sequences, homologous protein sets were retrieved for each marker protein, aligned with MAFFT, trimmed with TrimAl and trees inferred with IQ-TREE.The single-protein trees were visually inspected to remove duplicates, paralogues, and any other visual outlier such as extremely divergent sequences.The curated homologous protein sets were finally aligned again with MAFFT v.7.3.10 and the L-INS-I method.
To increase phylogenetic signal by removing poorly aligned and non-homologous aligned regions, Divvier v.1.0was used with the -partial and -mincol options 71 .Only sites with more than 10% of data were retained.To reduce incongruency among proteins due to, for example, lateral gene transfer, Phylo-MCOA v.1.4 72was employed on single-protein trees with UFBoot2+NNI as branch support which were inferred with IQ-TREE v.1.6.10 and the best-fitting model as identified by Model-Finder 70,73 .Single-protein alignments were concatenated with SequenceMatrix v.1.8 74.

Phylogenetic analyses using site-heterogeneous models
For multi-protein phylogenetic analyses on the supermatrix, trees were first inferred in IQ-TREE v.1.6.10 under the LG4X+F model.The resulting site-homogenous tree was then used as a guide tree to infer a new phylogenetic tree under the LG+PMSF(C60)+F+G4 model 75 .Consequently, the resulting siteheterogenous tree was used as a guide tree to infer a new phylogenetic tree under the dataset-specific LG+PMSF(MAM60)+F+G4 model.The dataset-specific MAM60 model was estimated using the MAMMaL software 43 .This site-heterogeneous mixture model is directly inferred from the dataset analyzed and therefore is more specific than the general C10-60 mixture models.To account for more than 60 (e.g., C60 or MAM60) amino-acid composition profiles across the data, we used the general UDM128 mixture model as LG+UDM128+G4+F that allows for 128 amino acid composition profiles 76 .The software FunDi was used to estimate functionally divergent sites in the branch that separates the mitochondrial lineage from all other taxa 42 .Sites with a probability of being functionally divergent > 0.5 were removed.Progressive removal of compositionally heterogeneous sites was performed according to the ɀ and the χ 2 metrics/methods as described before 11,29,44 .Both metrics are designed to estimate compositional heterogeneity per site based on different criteria.
PhyloBayes MCMC chains were run for >20,000 cycles or until convergence between the chains was achieved and the largest discrepancy in posterior probabilities for splits between chains ('max-diff') was <0.1.Individual chains were summarized into a Bayesian consensus tree using a burn-in of 500 trees and subsampling every 10 trees.However, most chains did not reach convergence or resolve the phylogenetic placement of mitochondria relative to alphaproteobacterial lineages (Mendeley Data).

Phylogenetic analyses using the site-and-branch-heterogeneous GFmix model
The site profile mixture models discussed above have C site frequency profiles and a K-class discretized gamma mixture model for site rates.Under these models, the likelihood of site pattern   at site i is given by: (  ;   , ) = ∑ where   is the frequency of amino acid j.For every branch e in the phylogenetic tree under consideration, we can obtain estimates by a hierarchical procedure where be is obtained from the GARP/FIMNKY ratio of all of the sequences at the tips of the tree that descend from branch e.Using these estimates, the values in the class frequency vectors,  (𝑐) , for any site profile class are modified in the following way to be branch-e-specific class frequencies,  (𝑐𝑒) .
The modified class frequencies have to satisfy a number of constraints including: = 1 and alternative hypothesis was selected based on the data rather than being fixed a priori, this can induce a selection bias 81 .The Bonferroni approach uses a input set of trees and approximates the p-value when   : τ = τ ̂ by the Bonferroni-corrected p-value one would obtain testing  0 : τ = τ 0 against   : τ = τ  ,  ∈  where  is the set of input trees that are compatible with the consensus tree of τ 0 and τ ̂.
The approximation is based on probability calculations treating the consensus tree of τ ̂ and τ 0 as the true tree.This is consistent with what is done in the chi-square test and in testing more generally, where one often calculates p-values under parameters on the boundary between the null and alternative hypotheses spaces (see 80 for additional discussion).If the true tree is the consensus tree, then it is likely that the ML topology will be in .Because the largest likelihood is the one corresponding to τ ̂, the smallest p-value among the () p-values obtained by testing  0 : τ = τ 0 against   : τ = τ  is likely to be (τ ̂); there is some possibility that a tree with a smaller degrees of freedom would give the smallest p-value, so this is an approximation.In summary, (τ ̂) is approximately the same as the minimum p-value obtained by testing  0 : τ = τ 0 against   : τ = τ  .
Rephrasing the test as approximately the same as the result of multiple tests  0 :  =  0 against   :  =   ,  ∈  lays bare that multiple testing is the source of selection bias.Bonferroni correction is a widely used approach to adjusting for multiple testing.As one final approximation, rather than using the usual Bonferroni-corrected p-value, n(A) p(τ ̂), we use the exact correction had the p-values coming from the tests been independent, This p-value is approximately the same as the usual Bonferroni correction when () () is small, which is the case of greatest interest, but has the advantage of always being between 0 and 1.Additional information about the Bonferroni correction is available in 82 .

Profile Hidden Markov Model (pHMM) searches
To search for bacteriochlorophyll enzymes, a set of 17 custom-made pHMMs for the genes bchB, bchC, bchD, bchE, bchF, bchG, bchH, bchI, bchJ, bchL, bchM, bchN, bchO, bchP, bchX, bchY, bchZ was used against predicted proteomes from the MAGs reconstructed in this study.These pHMMs were created from manually curated sets of bch genes from diverse proteobacteria.The searches were done with the program hmmsearch of the HMMER suite using an E-value cut-off of 1E-25.To search for mitofilindomain containing mic60 genes, the Pfam pHMM for Mitofilin (PF09731) was used with its own GA cut-off value.

Figure 1 .
Figure 1.An expanded gene set and novel alphaproteobacterial MAGs from diverse environments.(A) Principal Component Analysis (PCA) of amino acid compositions for each one of the 108 mitochondrial genes of alphaproteobacterial origin used in this study.Mitochondrion-encoded genes (light red); Mitochondrion-and nucleus-encoded genes (light blue); nucleus-encoded genes (green); 95% confidence ellipses follow the same color code as genes.This PCA was inferred from alignments that contain only eukaryotes.(B) Functional classification of the marker genes of alphaproteobacterial origin in eukaryotes used for multi-gene phylogenetic analyses in this study.All these functions take place inside mitochondria.A: Complex I subunit/assembly factor; B: Complex II subunit/assembly factor; C: Complex III subunit/assembly factor; D: Complex IV subunit/assembly factor; E: Complex V subunit/assembly factor; F: Cytochrome c biogenesis; G: D-lactate dehydrogenase (respiratory chain); H: Pyruvate

Figure 2 .
Figure 2. Phylogenetic tree of the Alphaproteobacteria and mitochondria, and support from our new site-and-branch-heterogeneous model.(A) Phylogenetic tree for the Alphaproteobacteria and mitochondria derived from a site-heterogeneous analyses of an untreated dataset.(B) Phylogenetic tree for the Alphaproteobacteria and mitochondria derived from a site-heterogeneous analysis of a dataset from which 50% of the most compositionally heterogeneous sites according to the ɀ metric had been removed.The removal of this amount of ɀ sites minimizes the variation of G A R P/F I M N K Y amino acid ratios across taxa (TableS5).The taxonomic labels follow the higher-level taxonomy outlined in29 .
(  |   , ()  ; )  ⁄  =1Where   is the site rate of gamma-rates class k, ()  is the vector of amino acid frequencies in class c of the site-profile mixture model,   is the class weight and  is the vector of other adjustable parameters (branch lengths,  shape parameter and tree topology) in the model.In order to model shifts in the relative frequencies of the amino acids G A R P (specified by G+C-rich codons) and F I M N K Y (specified by A+T-rich codons) in different branches of the tree, the foregoing vectors of amino acid frequencies, ()  , are modified in a branch-specific manner in the following way.Let b denote the ratio of aggregate frequencies of G A R P to F I M N K Y amino acids; i.e.,  ∶=     ⁄ for   = ∑    ∈{,,,} and   = ∑    ∈{,,,,,} , , , }  ()   ()   ()  ∈ {, , , , , }  ()   () ℎ and ∑   ()