Positive Correlation between Bacterial GC Content and Growth Temperature

12 Because GC pairs are more stable than AT pairs, GC-rich genomes were proposed to be more adapted 13 to high temperatures than AT-rich genomes. Previous studies consistently showed positive correlations 14 between growth temperature and the GC contents of structural RNA genes. However, for the whole 15 genome sequences and the silent sites of the codons in protein-coding genes, the relationship between 16 GC content and growth temperature is in a long-lasting debate. With a dataset much larger than 17 previous studies (681 bacteria and 155 archaea), our phylogenetic comparative analyses showed 18 positive correlations between optimal growth temperature and GC content both in bacterial and 19 archaeal structural RNA genes and in bacterial whole genome sequences, chromosomal sequences, 20 plasmid sequences, core genes, and accessory genes. However, in the 155 archaea, we did not observe 21 a significant positive correlation of optimal growth temperature with whole-genome GC content or GC 22 content at four-fold degenerate sites. We randomly drew 155 samples from the 681 bacteria for 1000 23 rounds. In most cases (> 95%), the positive correlations between optimal growth temperature and 24 genomic GC contents became statistically nonsignificant ( P > 0.05). This result suggested that the 25 small sample sizes might account for the lack of positive correlations between growth temperature and 26 genomic GC content in the 155 archaea and the bacterial samples of previous studies.

Introduction 32 archaeal genomes are the best materials to test the thermal adaptation hypothesis. An analysis of 764 48 prokaryotic species, including mesophilic genera and thermophilic genera, did not find a correlation 49 between whole-genome GC content (GCw) and the optimal growth temperature (Topt) (Galtier and 50 Lobry 1997). However, this study found a significant positive correlation between Topt and the GC 51 content of structural RNA (tRNAs and rRNAs). The rationale of these observations is that the 52 secondary structures of tRNAs and rRNAs are more sensitive to high temperature than the double-53 strand helix of DNA. In most prokaryotes, protein-coding genes take most of the genome size. Protein 54 structures and functions constrain the GC content evolution at the nonsynonymous sites of the codons. 55 This functional constraint might conceal the hypothetical thermal adaptation. Compared with GCw, the 56 GC content at the third sites of the codons (GC3) is more desirable to test the thermal adaptation 57 hypothesis. Early solitary cases indicated that GC3 might be related to growth temperature. For 58 example, the tyrosy1-tRNA synthetase gene isolated from the thermophile Bacillus stearothermophilus 59 has a higher GC3 than the homologous gene in Escherichia coli, 68.0% vs. 59. 4% (Winter et al. 1983). 60 The leuB gene isolated from the extreme thermophile Thermus thermophilus HB8 has an extremely 61 high GC3, 89.4% (Kagawa et al. 1984). For a general conclusion, Hurst and Merchant (2001) 62 examined the relationship between GC3 and Topt of 29 archaeal species and 72 bacterial species. 63 Unfortunately, they did not find significant correlations between Topt and GC3 or between Topt and 64 GCw. At the same time, they also found a significant positive correlation between the GC content of 65 structural RNAs and the Topt in both archaea and bacteria. As their analysis had accounted for the 66 effect of shared ancestry, they provided more robust evidence against the thermal adaptation 67 hypothesis. Soon afterward, Xia et al. (2002) showed that the growth at increasing temperature (from 68 37C to 45C) for 14,400 generations did not increase but decreased the genomic GC content of the 69 bacterium Pasteurella multocida. Furthermore, Lambros et al. (2003) reported a negative correlation 70 between optimal growth temperature and the GC content of protein-coding genes in 550 prokaryotes, 71 despite that the effect of the shared ancestry had not been controlled. 72 Subsequently, Musto et al. (2004) published a debate-provoking study. As many environmental 73 factors likely influence genomic GC content evolution, closely related species are expected to differ in 74 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 3 fewer environmental factors than distantly related species. The correlation of GC content with growth 75 temperature is less likely disturbed by other factors when the analysis is limited within closely related 76 species. Therefore, Musto et al. (2004) examined the relationship between genomic GC content and 77 Topt with each prokaryotic family. Among the 20 families they studied, the number of families with 78 positive correlations is significantly higher than expected by chance, no matter the effect of the 79 common ancestors was accounted for or not. Meanwhile, they observed a significant positive 80 correlation when considering all independent contrasts from different families together. However, 81 Marashi and Ghalanbor (2004) noticed that most of the significant correlations within each family 82 depend heavily on the presence of a few outlier species. Exclusion of only one species would lead to 83 loss of significant correlations in several families. Basak et al. (2005) pointed out that the correlation is 84 sensitive to the presence or absence of a few outliers in some families because the sample sizes in 85 these families were too small. Using non-parametric correlation analysis that is not sensitive to the 86 presence of outliers, Musto et al. (2005) repeated their analysis and confirmed their previous results. 87 The debate did not end after that. Wang et al. (2006) updated the Topt values for some species and 88 found that the positive correlation between Topt and genomic GC content in two families disappeared. 89 Besides, they suggested that the positive correlation between Topt and genomic GC content in the 90 family Enterobacteriaceae should be explained by the correlation between genome size and optimal 91 temperature. Still, this study did not shake the confidence of Musto et al. (2006) Merchant (2001) and Xia et al. (2002), but without mentioning the debates on Musto et al. (2004). 97 As prokaryotic genomes often have many accessory genes that are frequently lost and gained, the 98 genome-wide measures of GC content could roughly reflect the shaping effects of environmental 99 factors in evolution. By contrast, the structural RNA genes ubiquitously exist in prokaryotic genomes, 100 and their GC contents are more comparable in large-scale phylogenetic analyses. Similarly, the core 101 genome or strictly defined orthologous genes could also accurately reflect the historical shaping effect 102 of growth temperature on GC content evolution. Ream et al. (2003) analyzed the GC contents of two 103 genes (ldh-a and -actin) across 51 vertebrate species with adaptation temperatures ranging from 104 −1.86C to approximately 45C. They did not find any significant positive correlations between living 105 temperature and GC content, whether the GC content is measured by the entire sequences, the third 106 codon position, or the fourfold degenerate sites. However, Zheng and Wu (2010) found a positive 107 correlation between growth temperature and the GC content in the coding regions of four genes across 108 815 prokaryotic species, including mesophiles, thermophiles, and hyperthermophiles. These four 109 genes shared by all the 815 prokaryotic genomes could be considered strictly defined core genomes. 110 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ; https://doi.org/10.1101/2021.04.27.441598 doi: bioRxiv preprint 4 Using a manually collected dataset of growth temperature and without accounting for the effect 111 of the common ancestors, Sato et al. (2020) recently confirmed the results of Galtier and Lobry 112 (1997). It should be noted that the correlation between Topt and the GC content of structural RNA was 113 consistently observed in much more studies than those mentioned above (Khachane et al. 2005;114 Kimura et al. 2006;Kimura et al. 2007;Kimura et al. 2013;Sato et al. 2020). By contrast, as reviewed 115 above, the correlation between Topt and genomic GC content, if it exists, depends heavily on the 116 sample size, the families of prokaryotes, the particular sequences, and the methods used to detect it. 117 Benefit from the manually curated dataset of growth temperature from the database TEMPURA 118 (Sato et al. 2020), we carried out a comprehensive analysis on the relationship between growth 119 temperature and GC content. The present study covers three indexes of growth temperature (maximal 120 growth temperature [Tmax], Topt, and minimal growth temperature [Tmin]) and a series of GC 121 content indexes, including GC content of the whole genome (GCw), GC content of the protein-coding 122 sequences (GCp), GC content at fourfold degenerate sites (GC4), GC content of the genes coding 123 structural RNAs (tRNA, GCtRNA; 5S rRNA, GC5S; 16S rRNA, GC16S; 23S rRNA, GC23S) and GC 124 content of non-coding DNA (GCnon, including intergenic sequences and untranslated regions of mRNA 125 that are generally unannotated in prokaryotic genomes). The whole genome, primary chromosome 126 genome sequences, plasmid genomes, core genes, and accessory genes have been examined separately. 127 Our results consistently support a positive correlation between genomic GC content and growth 128 temperature in bacteria. 129

Results 131
Strong phylogenetic signals in both GC contents and growth temperatures 132 A significant force shaping prokaryotic evolution is horizontal gene transfer, which makes the 133 genealogical relationships among bacteria and archaea exhibit a somewhat network-like structure. If 134 bifurcation is not the phylogeny's dominant pattern, most phylogenetic comparative methods will not 135 be necessary for prokaryotic evolutionary studies. We are not sure how much this impression has 136 influenced the researchers in prokaryotic genomic studies, but many papers did not use any CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ; https://doi.org/10.1101/2021.04.27.441598 doi: bioRxiv preprint

Bacterial but not archaeal genomic GC contents correlated with growth temperatures 148
We used the phylogenetic generalized least squares (PGLS) regression to examine the relationships 149 between GC contents and growth temperatures. The significant positive and negative slopes of the 150 regressions correspond to significant positive and negative correlations, respectively. Four 151 phylogenetic models, the Brownian motion model (BM), the Ornstein-Uhlenbeck model with an 152 ancestral state to be estimated at the root (OUfixedRoot), the Pagel's lambda model (lambda), and the 153 early burst model (EB), have been applied in the analysis. Their results are qualitatively identical and 154 quantitatively similar. Therefore, we present the BM model results in the main text and deposit other 155 models' results as supplementary tables. 156 Consistent with numerous previous studies, we found positive correlations between the GC 157 contents of structural RNA genes and growth temperature in bacteria and archaea (Table 2). We 158 noticed a rank in the slope values, from Tmax, Topt, to Tmin. 159 Interestingly, we also found positive correlations of Tmax and Topt with various indexes of 160 genomic GC contents, GCw, GCp, GC4, and GCnon, in bacteria (Table 2). Nevertheless, bacterial Tmin 161 is not correlated with three of the four GC content indexes (Table 2). In archaea, none of the three 162 temperature indexes (Tmax, Topt, or Tmin) have any significant correlations with any of the four 163 genomic GC content indexes (Table 2). 164 If growth temperature could shape GC contents by the stabilities of RNA secondary structures 165 and DNA double helix, a structural RNA or a DNA double helix that is stable at the Tmax or Topt is, 166 of course, stable at the Tmin. In this logic, it is reasonable to see that the Tmin has weaker or no 167 significant correlations with GC contents. 168 The difference in the correlations between bacteria and archaea might be attributed to either 169 unknown intrinsic differences between these two domains or the significant difference in the sample 170 size, 681 vs. 155. 171 172

Sample sizes matter 173
If the lack of significant correlations between genomic DNA and Tmax and Topt in archaea results 174 from the small sample size, the correlations in bacteria will be lost when the sample size of bacteria is 175 reduced to 155. For this reason, we randomly selected 155 bacteria from the 681 bacterial samples for 176 1000 rounds. The results of resampling analysis confirmed the idea, the sample sizes matter (Table 3). 177 In > 950 rounds, the genomic GC content indexes (GCw, GCp, GC4, and GCnon) are not correlated with 178 Tmax or Topt (P > 0.05). This result could also explain the difference between the present study with 179 Hurst and Merchant (2001), which did not find significant correlations between GCw/GC3 and Topt by 180 phylogenetic analysis of about 100 prokaryotes. Meanwhile, a few positive correlation cases happen, 181 indicating that significant positive correlations could also be found by chance when the analyzed 182 sample is small. 183 Besides, the correlations between growth temperature and the GC contents of structural RNA 184 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ; https://doi.org/10.1101/2021.04.27.441598 doi: bioRxiv preprint genes might also be lost occasionally when the sample size is severely reduced (Table 3). In the 1000 185 rounds of resampling, lacking significant correlations happens in 308 (for Tmax) and 473 (for Topt) 186 rounds for 5S rRNA genes, and 12 (for Tmax) and 21 (for Topt) rounds for tRNA genes. However, in 187 the 16S and 23S rRNA genes, positive correlations were consistently observed in all the 1000 rounds 188 of resampling. We suspected that the tens of times more nucleotides in 16S and 23S rRNA than 5S 189 rRNA make the results of 16S and 23S rRNAs less sensitive to small sample sizes. 190 In statistics, the rule of thumb boundary between small and large samples is n = 30. However, the 191 results in Table 3 indicate that n = 155 is a too-small sample in the phylogenetic comparative analyses 192 of the relationship between growth temperature and genomic GC content. Because of the common 193 ancestor, two closely related lineages with highly similar growth temperatures and GC contents should 194 be regarded as nearly one effective sample rather than two independent samples. The effective sample 195 size in phylogenetic comparative studies should be much lower than the census number of the 196 analyzed lineages. 197 198 Qualitative data on growth temperature lead to the same conclusion 199 In the ProTraits database (http://protraits.irb.hr/) and the IMG database (https://img.jgi.doe.gov/) (Brbić 200 et al. 2016;Chen et al. 2020), many prokaryotes lack quantitative measures of growth temperature but 201 are qualitatively classified into four categories: psychrophiles/psychrotrophiles, mesophiles, 202 thermophiles, and hyperthermophiles. We constructed a qualitative dataset of prokaryote growth 203 temperature, including data downloaded from these two datasets and the prokaryotes in the 204 TEMPURA database classified into the four categories referring (Sato et al. 2020). By assigning 1,2,205 3, and 4 to the psychrophiles/psychrotrophiles, mesophiles, thermophiles, and hyperthermophiles, 206 respectively, we transformed the qualitatively characters into numerical values. PGLS regression 207 analyses revealed a positive correlation between GC content and growth temperature in bacteria (slope 208 = 0.457, P = 0.001), but not in archaea (slope = −0.582, P = 0.170). Although this dataset (4696 209 bacteria and 279 archaea) is much larger than analyzed above (681 bacteria and 155 archaea), it lost 210 much information during the qualitative classification. All the differences in growth temperature 211 within each category disappear. 212 We also examined whether the contrast in the temperature category is correlated with the contrast 213 in the GC content between terminal tips of the phylogenetic tree by referring ( (Table 4). Interestingly, the same pattern was also found in the bacterial plasmids (

Positive correlations observed in both core genes and accessory genes 238
To correspond to the previous gene-centered studies (Zheng and Wu 2010), we examined the 239 correlations in bacterial core genes, i.e., genes present in all the bacteria. With the increase in the 240 number of bacteria, the number of core genes decreases rapidly. With a trade-off between the number 241 of core genes and the number of bacteria, we selected 28 core genes present in 420 genomes, mostly 242 ribosomal protein genes. Significant positive correlations have been found between GC contents (GCp 243 and GC4) and growth temperatures, Tmax, and Topt (Table 5). 244 At the opposite side of the core genes, the accessory genes present in one or a few bacteria. When 245 we define the accessory genes as the genes present in less than 5% of the analyzed bacterial genomes, 246 on average, each bacterium has 152 accessory genes. Positive correlations were observed between GC 247 contents (GCp and GC4) and growth temperatures (Tmax and Topt), although the values of 248 significance are slightly larger than those in core genes (Table 5). Similar patterns were observed when 249 we increased the threshold in defining accessory genes to 10% (P < 0.05 for all cases). 250 By the way, we compared the GC content between bacterial core genes and accessory genes using 251 a phylogenetic paired t-test (Lindenfors et al. 2010). Unlike the previous analysis of 36 prokaryotes 252 that did not account for the effect of common ancestors (Bohlin et al. 2017), we did not observe 253 significant differences in GC content between the core genes and the accessory genes (Table S16). We 254 also compared the chromosomal accessory genes and plasmid accessory genes. The accessory genes 255 on chromosomes have significantly higher GC contents than those on plasmids (Table S17). 256 257 Discussion 258 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ; https://doi.org/10.1101/2021.04.27.441598 doi: bioRxiv preprint 8 The GC pairs are thermally more stable than AT pairs in both DNA double helix and structural RNAs. 259 However, this difference is not necessarily a strong enough force to shape the evolution of GC content. 260 As RNA structures are more sensitive to the elevation of temperature than DNA double helix, the 261 growth temperature is stronger in shaping the GC content evolution of the structural RNA genes than 262 in shaping the genomic GC content evolution. Positive correlations between growth temperature and 263 the GC content of structural RNA genes have been repeatedly observed in various prokaryotic studies 264  Overlooking the effect of common ancestors would severely affect the accuracy of the results 278 (Felsenstein 1985). 279 Our resampling analysis indicates that the lack of significant correlations in archaea might result 280 from the small number of effective samples. We hope to repeat the present study in the future with a 281 larger sample of archaeal genomes. However, it should also be kept in mind that the possibility of no 282 correlation between GC content and growth temperature has not been convincingly excluded. Some 283 intrinsic differences between bacteria and archaea might produce a sharp difference in the correlation 284 between GC content and growth temperature. Most prokaryotes have negatively supercoiled DNA,285 whereas the prokaryotes that grow at temperatures higher than 80C (mostly archaea) generally have 286 their genomic DNA positively supercoiled with a particular enzyme, reverse gyrase (Vettone et al. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ; https://doi.org/10.1101/2021.04.27.441598 doi: bioRxiv preprint 9 nucleotide composition evolved independently in bacterial adaptation to high temperature. 296 As the frequent gain and loss of plasmids, the plasmid DNAs could be regarded as accessory 297 genomes. Because of the high turnover rates of plasmids and accessory genes in prokaryotic evolution, 298 we could regard them as new immigrants, as opposed to the natives for the chromosomes and core 299 genes. Although the core genes and even the ribosomal RNA genes may occasionally be transferred The new immigrants compatible with the host have a GC content adapted to the host's growth 316 temperature. 317 A previous serial transfer experiment seems to be contradictory to our results. Increased genomic 318 GC content was not observed in the bacterium P. multocida after 14,400 generations of increasing 319 temperature from 37C to 45C (Xia et al. 2002). Although we observed a positive correlation between 320 genomic GC content and growth temperature, we do not think a small increment in GC content, 321 resulting from either a GC-biased mutator or integration of a GC-rich exogenous sequence, would 322 bring a great advantage to the host organism. Most likely, it is just a slight advantage. According to the 323 population genetic theory, the slightly beneficial mutants will be efficiently selected only when they 324 are in a large population. The experimental evolution generally involves severe, periodic reductions in 325 population size, and the bottleneck effect dramatically reduces the fixation probability of beneficial 326 mutations (Wahl et al. 2002). As we see, large-scale statistical analysis has the advantage of revealing 327 slightly beneficial traits. 328 Musto et al. (2004) emphasize that only when closely related species are compared, the growth 329 temperature is likely to be the only influencing factor in GC content evolution. Our pairwise 330 comparison of neighboring branches with different ranks of growth temperature ( fig. 1) gave the same 331 conclusion as our PGLS analyses. We agree that many factors would influence GC content evolution, 332 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ; https://doi.org/10. 1101/2021 and the positive relationship between growth temperature and GC content is a statistically significant 333 result. In the 273 pairs of bacteria, there are 153 pairs where high growth temperature ranks have 334 higher GC contents and 119 pairs with the opposite pattern. 335 Finally, we should remark that what we observed are just correlations between GC content and 336 growth temperature, which imply rather than prove the causal effects between the two variables. The GC contents of these prokaryotes were calculated from their genome sequences. 352 We also constructed a large dataset according to their growth temperature qualitatively. First, we 353 divided the 836 prokaryotes mentioned above into four categories according to their growth 354 temperature referring to (Sato et al. 2020 (Table S18). The whole-genome GC contents of these prokaryotes were downloaded directly from the 362 NCBI genome database (https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt). with different ranks of growth temperature (psychrophiles/psychrotrophiles, mesophiles, thermophiles, 367 and hyperthermophiles). In cases where two or more neighboring tips with the same rank were used to 368 pair with bacteria with another rank, we used the average value of their GC contents to represent the 369 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 11 GC content of their internal node. 370 The phylogenetic signals () of both GC contents and growth temperatures were estimated using 371 the phylosig function of the R (Version 4.0.3) package phytools (Version 0.7-70) (Revell 2012 Tmax, Topt, and Tmin represent maximal, optimal, and minimal growth temperature, respectively; 549 GCw, GCp, GC4, GCtRNA, GC5S, GC16S, GC23S, and GCnon represent the GC contents of the whole 550 genome, the protein-coding sequences, the fourfold degenerate sites, the genes coding tRNAs, the 551 genes coding 5S rRNA, the genes coding 16S rRNA, the genes coding 23S rRNA, and the non-coding 552 DNA (including intergenic sequences and untranslated regions of mRNA), respectively. The 553 phylogenetic signals of the chromosomal genes, the plasmid genes, the core genes, and the accessory 554 genes are also very close to one and deposited in supplementary Table S1-S4. 555 556 557 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is  The results in this table were obtained using the Brownian motion model. Similar results obtained 574 from three other models are deposited in supplementary Table S10-S12. PBH, Benjamini-Hochberg 575 adjusted P value. Please see Table 1 for the meanings of the other abbreviations.  The results in this table were obtained using the Brownian motion model. Similar results obtained 583 from three other models are deposited in supplementary Table S13-S15. PBH, Benjamini-Hochberg 584 adjusted P value. Please see Table 1 for the meanings of the other abbreviations. 585 586 587 588 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. Pairwise comparison of the GC contents between closely related prokaryotes with different growth 593 temperature ranges. Both bacteria (A) and archaea (B) were classified into four ranks according to their 594 growth temperature, from low to high: psychrophiles/psychrotrophiles, mesophiles, thermophiles, and 595 hyperthermophiles. The diagonal line represents cases in which prokaryotes with different ranks have 596 the same GC contents. Points above the line (153 pairs of bacteria and 17 pairs of archaea) represent 597 cases in which prokaryotes with higher ranks have higher GC contents than their paired relatives, while 598 points below the line (119 pairs of bacteria and 24 pairs of archaea) indicate the reverse. All the p values 599 were calculated using two-tailed Wilcoxon signed-rank tests. The exact values of the GC contents are 600 shown in supplementary Table S9.  601  602  603  604 605 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted May 6, 2021. ; https://doi.org/10. 1101/2021