The trait coding rule in phenotype space

Genotype and phenotype are both the themes of modern biology. Despite the elegant protein coding rules recognized decades ago in genotype, little is known on how traits are coded in a phenotype space (P). Mathematically, P can be partitioned into a subspace determined by genetic factors (PG) and a subspace affected by non-genetic factors (PNG). Evolutionary theory predicts PG is composed of limited dimensions while PNG may have infinite dimensions, which suggests a dimension decomposition method, termed as uncorrelation-based high-dimensional dependence (UBHDD), to separate them. We applied UBHDD to a yeast phenotype space comprising ~400 traits in ~1,000 individuals. The obtained tentative PG matches the actual genetic components of the yeast traits, explains the broad-sense heritability, and facilitates the mapping of quantitative trait loci, suggesting the tentative PG be the yeast genetic subspace. A limited number of latent dimensions in the PG were found to be recurrently used for coding the diverse yeast traits, while dimensions in the PNG tend to be trait specific and increase constantly with trait sampling. A similar separation success was achieved when applying UBHDD to the UK Biobank human brain phenotype space that comprises ~700 traits in ~26,000 individuals. The obtained PG helped elucidate the genetic versus non-genetic origins of the left-right asymmetry of human brain, and reveal several hundred novel genetic correlations between brain regions and dozens of mental traits/diseases. In sum, by developing a dimension decomposition method we show that phenotypic traits are coded by a limited number of genetically determined common dimensions and unlimited trait-specific dimensions shaped by non-genetic factors, a rule fundamental to the emerging field of phenomics.


Introduction 30
The physical world is both macroscopic and microscopic, the former of which is the manifestation 31 of the latter. Physicists adopt two rather parallel frameworks to describe the world: classical 32 mechanics for the macroscopic layer and quantum mechanics for the microscopic layer 1 . For 33 biologists, the macroscopic layer is phenotype and the microscopic layer is genotype. The 34 mainstream of current biology adopts a bottom-up thinking: because genotype is the basis of 35 phenotype, we rely on the former to understand the latter 2 . However, efforts of applying genotype  55 where P represents the phenotype space formed by all T, P G represents the genetic subspace formed 56 by all T G , and P NG represents the residual (or non-genetic) subspace formed by all T NG . 57 Specifically, P, P G and P NG are each a multi-dimensional linear space described by a matrix in 58 which columns are trait vectors. Following the matrix notation there exists a set of orthogonal 59 base vectors in P G , which we term as G-dimensions. Linear combinations of the G-dimensions 60 can form all vectors in P G (i.e., all T G ). Similarly, the NG-dimensions in P NG can be defined. 61 Importantly, the number of G-(or NG-) dimensions is larger than or equal to the rank of P G (or 62 4 P NG ). Accordingly, each trait T can be formulated as a linear function of the G-dimensions and 63 NG-dimensions: converges to one if the two vectors have a strong correlation for any finite N, which is formulated  Fig. 1d, if the dimensionality N is much smaller in P G than in P NG . As a result, the best model traits, by conducting such uncorrelation-based separation for every trait in P we would achieve the 124 7 separation of P G from P NG . We term the method uncorrelation-based high-dimensional 125 dependence (UBHDD).

127
Validation of UBHDD using simulation 128 To test if UBHDD can separate subspaces of distinct dimensionality, we simulated a space P that respectively; the resulting matrices containing all T g or all T ng are called P g or P ng , approximating 143 P G and P NG , respectively. As expected, with the increase of trait sampling the number of sampled dimensions is much 145 more rapidly saturated for P G than P NG (Fig. 2a). We noted that the sampled dimensions in P NG 146 would keep increasing if the dimensionality of P NG were infinitely large. Two correlated traits 147 often share both G-dimensions and NG-dimensions while two uncorrelated traits could share G-148 dimensions but rarely NG-dimensions (Fig. 2b). This suggests G-dimensions but not NG-149 dimensions would underlie the signal of UBHDD. Indeed, in all cases we found the T g obtained 150 by UBHDD highly correlated with T G , the actual genetic component of T (Fig. 2c). The variance 151 of T g also matches well the variance of T G , the broad-sense heritability of T (Fig. 2d). We also 152 simulated spaces with N1 = 20, 50, or 100 (N2 remains unchanged), and obtained largely the same 153 results (Fig. S1). These analyses validated the capacity of UBHDD in separating P G from P NG .

154
It is worth noting that UBHDD is a method of dimension decomposition but not dimension 155 reduction. We compared UBHDD with PCA, a classical dimension reduction method, in a 156 simulated P with structure. The structured P was simulated as above except that two large clusters 157 with strongly correlated members exist ( Fig. 2e; Supplementary Note III). UBHDD remains 158 successful in separating P G from P NG , insensitive to the space structure (Fig. 2f). However, PCA 159 overfits the traits in the two large clusters and underfits the others ( Fig. 2g; Methods). The failure 160 of PCA in separating P G from P NG is not surprising because PCA maximizes the explained variance 161 of the top PCs and is therefore sensitive to data structure.

163
Using UBHDD to separate a yeast phenotype space 164 We examined a phenotype space comprising 405 morphological traits of the budding yeast which has two clones/replicates and known genotype 26 (Fig. 3a). The traits are typically about 167 area, distance, angle, and brightness that describe the shape of mother cell and bud, the neck 168 separating mother cell from bud, the localization of the nuclei in mother cell and bud, and so on, 169 across different cell stages (Fig. 3b). The narrow-sense heritability (h 2 ) of the traits ranges from 0 170 to 0.56 with a median of 0.15, and the broad-sense heritability (H 2 ) ranges from 0 to 0.86 with a 171 median of 0.42 (Fig. S2).

172
Since biological replicates are available for the yeast phenome, we can use linear mixed 173 model (LMM) to separate the T G from T NG for each of the traits. Meanwhile, the separation could 174 be done by UBHDD, which requires only phenome information according to the above theory and 175 simulation results (Fig. 3c). We will then use the results of LMM to benchmark UBHDD. 176 We applied UBHDD to the 405 yeast traits and obtain for each of them the T g and T ng 177 (Methods). The obtained T g explains trait variance at a level ranging from 0.03 to 0.98, with a 178 median=0.53 among all traits (Fig. 3d). Hence, strong high-dimensional dependence between the 179 uncorrelated yeast traits is observed. To assess the potential false positive/background signals, we 180 conducted shuffling analyses by randomly swapping the focal trait values among individuals while 181 maintaining the uncorrelated traits unchanged (Methods). We found virtually no trait variance 182 explained (maximum=0.013 among all traits) by the T g obtained in the shuffled dataset (Fig. 3d). 183 Hence, technical biases in the UBHDD modeling process are negligible. Notably, the results of 184 the shuffling analyses are actually consistent with our intuition in the empirical world that 185 uncorrelated objects are independent, which has a hidden assumption for infinite dimensionality.

186
The observed strong UBHDD signals suggest a special set of latent dimensions underlying the 187 yeast traits.

188
To test if the UBHDD signals represent actual genetic components, we applied LMM to 189 separate T G from T NG for each trait by taking advantage of the replicate information (Methods).

190
For most of the traits the UBHDD signal T g is highly correlated to the actual genetic component 191 T G (Fig. 3e-f). The variance of T g is comparable to the variance of T G , the broad-sense heritability 192 estimated by LMM (Fig. 3g). The results are robust against the Ru thresholds used for defining 193 uncorrelated traits (Fig. S3). As another critical test, we expect T g should have a larger narrow-194 sense heritability (h 2 ) than T ng . Indeed, in most case the h 2 of T g is larger than that of T ng , and also 195 more QTLs were detected for T g than T ng (Fig. 3h-i; Methods). Nevertheless, T g is not identical to 196 T G . The T g estimation could be improved in a larger population that enables more robust UBHDD 197 modelling; meanwhile, the T G estimation could be more accurate if there were more than two 198 replicates. Taken together, these results suggest the T g obtained by UBHDD represents well the 199 actual genetic components of the yeast traits. The separations by UBHDD are robust between two yeast populations 202 In addition to the segregant population (seg-population), we also examined a yeast gene-deletion 203 population (del-population) that contains ~5,000 S. cerevisiae strains each lacking a non-essential 204 gene (Fig. 3j). The same 405 traits are measured for each of the strains in the del-population 27 . 205 We conducted UBHDD in the del-population and obtained the T g and T ng for each of the traits 206 (Methods). We then compared the T g functions learned in del-population with the T g functions 207 previously learned in seg-population (Methods). Taking the trait C11.1_A as an example, when 208 the T g function learned in seg-population is applied to del-population, the T g estimations are highly 209 similar to the estimations by the T g function learned in del-population, with an identity score = 210 0.88 ( Fig. 3k; Methods). The identity score of the 405 traits ranges from 0.29 to 0.99 with a 211 median=0.82 (Fig. 3l), suggesting the genetic subspace obtained by UBHDD be robust between 212 the two yeast populations.

214
Using UBHDD to separate human brain phenotype space 215 To test if UBHDD works in a more complex phenotype space, we examined UK Biobank human 216 phenome. We focused on the 675 image-derived phenotypes (IDPs) of brain generated by dMRI 217 in 25,957 white British individuals without kinship and with genotype available ( Fig. 4a; 218 Methods) 28 . These brain image traits represent nine different measures including fractional 219 anisotropy (FA), intra-cellular volume fraction (ICVF), isotropic or free water volume fraction 220 (ISOVF), mean diffusivity (MD), diffusion tensor mode (MO), orientation dispersion index (OD) 221 and the three eigenvalues in a diffusion tensor fit (L1, L2 and L3) in up to 75 brain regions. 222 We applied UBHDD to the 675 brain image traits after excluding covariates and obtained 223 for each of them the T g and T ng (Methods). The obtained T g explains trait variance at a level 224 ranging from 0.17 to 0.87, with a median=0.48 among all traits (Fig. 4b). We conducted the same 225 shuffling analysis as in yeast and again found virtually no trait variance (maximum=4e-4 among 226 all traits) explained by the T g obtained in the shuffled dataset (Methods). The results are robust 227 against the Ru thresholds for defining uncorrelated traits (Fig. S4). Because there are, unlike yeasts, 228 no clones (i.e., monozygotic twins) for most individuals, we couldn't use LMM to estimate T G and 229 broad-sense heritability. Instead, we examined narrow-sense heritability. Consistent with the 230 findings in yeast, T g in general has a larger h 2 than T ng ; there are also more QTLs detected in T g 231 than T ng (Fig. 4c-d; Methods). Notably, for those traits with a strong enrichment of the additive 232 variance in T g , the number of QTLs of T g is even larger than that of the whole trait T, suggesting 233 novel genetic basis revealed by focusing on T g (Fig. S5). These data suggest the T g obtained here 234 be at least enriched with the genetic components of the brain image traits. The results have two 235 immediate applications. 236 First, it is helpful for addressing a long-standing puzzle, namely, the relative contribution 237 of genetic versus non-genetic factors to the left-right asymmetry of human brain 29,30 . We examined 238 all 297 symmetrical trait pairs each representing the same measure in two symmetrical brain 239 regions. For each trait pair we calculated the Pearson's R 2 of T g and T ng , respectively, among the 240 individuals. In all trait pairs the R 2 of T g is much larger than that of T ng (Fig. 4e). This finding 241 suggests non-genetic factors be the major source of the brain asymmetry, highlighting 242 environmental effects on asymmetry associated brain physiology and dysfunction. 243 Second, because of the enrichment of genetic component T g should be particularly useful 244 for identifying genetic correlations of the brain image traits with other traits including diseases.

245
Such genetic correlations can inform the specific brain regions associated with or responsible for 246 diseases, which would be valuable for diagnosis and/or therapy. We calculated genetic 247 correlations 31 between the 675 brain image traits and a curated set of traits with required summary 248 statistics 32 . These traits include 33 common mental traits (including diseases and non-diseases), 249 13 respiratory/circulatory diseases that are associated with autonomic nervous system, and 32 250 miscellaneous diseases that do not seem to be tightly linked with brain (Methods; Table S1). A with the brain image traits than the miscellaneous diseases. Second, T g performed much better 255 than T in revealing genetic correlations. The results in turn support the enrichment of T g for genetic 256 component.

257
To show more details we plotted all statistically significant genetic correlations for the 258 mental traits and the respiratory/circulatory diseases, respectively (Fig. 5d-e). There are a few 259 global patterns: First, brain regions vary substantially in the number and profile of correlated 260 diseases/traits. For example, the brain region "fornix" has significant genetic correlation with only 261 one disease Schizophrenia, while the region "superior fronto-occipital fasciculus" has significant  Distinct dimensionality of P g and P ng reveals a trait coding rule 285 Using UBHDD we estimated the genetic component T g and non-genetic component T ng for each 286 of the traits examined in yeasts and humans. Combining all T g of the yeast traits (or human brain 287 traits) forms P g , the estimated genetic subspace of the yeast (or human brain) phenotype space.

288
Similarly, combining all T ng forms P ng , the estimated non-genetic subspace. We then examined the number of PC dimensions is rapidly saturated for P g but not for P ng (Fig. 6a-b), highlighting 294 the distinct dimensionality between P g and P ng . The observed dimensionality disparity is 295 consistent with the underlying theory of UBHDD.

296
To show how the dimensions of P g and P ng are used by the traits we calculated the gradient 297 between the number of sampled traits (nT) and the number of obtained dimensions (nD), denoted 298 15 as T D n n ∆ ∆ . With the increase of dimensionality the gradient rapidly increases to be large for P g 299 but remains small for P ng , suggesting the P g dimensions are recurrently used by the traits while the 300 P ng dimensions tend to be trait-specific ( Fig. 6c-d). Consistently, the pairwise correlation of T g , 301 which reflects dimension sharing between traits, is much larger than that of T ng (Fig. 6e-h).

302
Therefore, in both the yeast and human brain phenotype space the traits are coded by a rather small 303 set of common dimensions that are determined by genotype and numerous trait-specific 304 dimensions that are shaped by non-genetic factors.

307
Inspired by the evolutionary 'cost of complexity' theory in this study we designed a dimension 308 decomposition method for separating subspaces of distinct dimensionality. We applied the method 309 to a yeast phenotype space and a human brain phenotype space, respectively, to separate genetic 310 subspace from non-genetic subspace. The separation results were then validated by available 311 benchmarks. Despite the success, we cautioned that the results are just consistent with the 312 evolutionary theory; resolving the debates on the theory 35,36 , which is beyond the scope of this 313 study, requires further works.

314
The goal of this study is to find how traits are coded in phenotype space. Our analyses

322
There are a few technical issues worth discussing. First, the UBHDD method depends on 323 dense sampling of a phenotype space. We may use a down-sampling strategy to assess the 324 sufficiency of trait sampling. We found the overall performance of UBHDD for the yeast traits is 325 nearly saturated (Fig. S7a); however, the performance for the human brain traits is sensitive to 326 down-sampling ( Fig. S7b), suggesting the current sampling of the brain space is still insufficient.

327
Second, the uncorrelation thresholds (Ru) used in this study may not be ideal. In principle, a 328 smaller Ru is always helpful for avoiding the effects of non-genetic dimensions, which, however, 329 would leave too few traits for conducting UBHDD. We found a good assessment of the threshold conceivable that, like phenotype space, many complex systems can be partitioned into a sub-341 system determined by intrinsic factors and another sub-system shaped by extrinsic factors, the 342 former of which is of rather low dimensionality while the latter is composed of myriad dimensions.

343
Hence, UBHDD could be a generally useful tool for studying a complex system.

346
Yeast segregant population (seg-population) 347 We study a panel of segregants of a yeast cross (S. cerevisiae strain BY × strain RM) generated by in the seg-population to make models obtained from the two populations comparable.

362
We collect the 870 brain MRI phenotypes and related covariates (  The random effect estimated for each segregant is defined as P G . The R package 'lme4' is used. 395 Standard error is estimated by Jackknife.  The R package 'rrBLUP' is used and standard error is estimated by Jackknife for yeast traits.

QTL mapping in yeast 407
In yeast we follow the pipeline used in a previous study 26 . The association between a focal trait 408 and a focal SNP is calculated as LOD score defined by -n(ln(1-R 2 )/2ln (10) Uncorrelation-based high-dimensional dependence (UBHDD) modelling 425 The model is formulated as based on t-test with multiple-testing correction (p=0.01/(n-1) where n is the number of traits. In 444 del-population, the threshold is set the same with that of seg-population to be comparable. In 445 human brain, the threshold is set to be 0.15 by referring to the threshold in seg-population.

446
To control for potential technical bias, we also conduct shuffling analysis. For a focal trait 447 Ti in Eq. (9), we keep its uncorrelated traits Tj unchanged and shuffle Ti among individuals. Then, 448 the same modelling process is conducted.

449
Comparison between UBHDD and PCA in simulated structured population 450 We first simulated a structured population (Supplementary Note III). Then, we apply UBHDD to 451 the structured population and obtain the P g . Next, we apply PCA to the structured population and   Code and data availability 565 The codes and the supporting data can be found at https://github.com/Jianguo-Wang/UBHDD.

Non-genetic variation of complex traits can result from human definitions
Assuming a complex trait (T) is independently contributed by genetic factors (G) and a random noise (E), we have where 0 µ ≠ is the mean. Then, assuming the cube of T is also defined as a complex trait, we have where new terms We conduct simulation and obtain the broad-sense heritability (H 2 ) of T and T 3 by linear mixed model (the same with that in Methods), shown as Therefore, human definitions can be a source of non-genetic variation.

Supplementary Note II
The probability of two traits with the same dimensions We achieve this estimation by transforming this problem into a geometric model of probability. We will first derive the general expression and then give the closed form in a specific condition. First, consider a general condition where two k-dimensional unit vectors α and β with angle equal to θ share i dimensions in an N-dimensional space (i.e., α and β each has k non-zero entries and Nk zero entries, sharing i non-zero entries), assuming α β α β θ α β β α β α β where ' α β α β = i i and 1 β = . Therefore, we obtain the geometric distribution of β as Therefore, combing Eq. (5) and Eq. (8), we obtain the geometric estimation for certain α and β given k, i and θ as 1 , , cos , 2 1 ( ) ( ) , (1) ( 1 cos ), Therefore, the probability of two k-dimensional traits sharing the same dimensions (ψ) in an Ndimensional space given marginal correlation (cosθ) is estimated as , , , where R=cosθ. It is obvious that Eq. (14) still satisfies Eqs. (12-13). Eq. (14) is used to generate the trajectories in Fig. 1d. From Eq. (12-13) we can obtain three corollaries: Corollary 2:       We conduct PCA in or and define the top PCs with 85% of variance explained as PC dimensions. Different cutoffs (75%, 85%, 95% and 100%) are compared. The error bars shown in lines represent 95% quantile of 100 sampling repeats. Middle lines represents the mean value. Notably, the number of PC dimensions in is always underestimated because PCA tends to merge independent dimensions in a population of small same size, especially when the dimensionality of subspace is larger than the rank of matrix. In the contrary, the PC dimensionality of well approximates the actual dimensionality of the subsapce at the 85% cutoff. When larger cutoffs (95%, 100%) are chosen, the PC dimensions of subspace will be overestimated. The overestimation happends because weak noise of modelling is falsely taken as dimensions. To facilitate comparison, the actual dimensionality in or subspaces are plotted the same with Fig. 1f. The seemingly aberrant error bar at the 100% cutoff is also contributed by the PCA method (R function princomp return different number of PCs with 100% variance explained when traits are reordered.). The accurate separation of genetic and non-genetic subspaces not only depends on rational uncorrelation threshold but also enough trait sampling. We conduct the same learning process for different proportions of trait subsets from 10% to 100%, say, a down-sampling strategy. Then, the distribution of UBHDD performance ( R , 2 the variance of genetic component estimated by UBHDD) is compared among these trait subsets. (a) shows the distributions of yeast. (b) shows the distributions of human brain. . To judge the uncorrelation threshold ( ), we provide a statistical test as follows. First, we calculate the square of marginal correlation ( ) between the focal trait and each of its uncorrelated traits. Then, we calculate the square of coefficients ( ) for each uncorrelated trait in the learned linear function, say, . An optimal threshold is determinded if the between and is insignificant, meanwhile, taking the number of uncorrelated traits available into account.