Comparative Gene Analysis in Chordate Embryonic Development Reveals Co-existence of Von Baer’s and Haeckel’s Postulates

21 The relationship between embryonic development and evolution historically 22 investigated based on embryo morphology could now be reassessed using mRNA 23 expression endophenotype. Here, we analyzed the conservation of divergence of the 24 developmental mRNA expression profiles in nine evolutionarily distinct species, from 25 oyster to mouse, and compared them to the original concepts formulated by von Baer 26 and Haeckel. We find nearly linear conservation of species’ developmental programs 27 among these species, compatible with models rooted on von Baer’s postulates, for 28 approximately a third of expressed orthologous genes. By contrast, 5-15% of 29 developmental expression profiles, enriched in neural genes, displayed an alignment 30 pattern compatible with the terminal edition paradigm proposed by Haeckel. Thus, the 31 development-evolution relationship based on mRNA expression agrees with early 32 concepts based on embryo morphology and demonstrates that the corresponding 33 patterns coexist in chordate development.


Abstract 21
The relationship between embryonic development and evolution historically 22 investigated based on embryo morphology could now be reassessed using mRNA including patterns involving partially similarity, development shifts, and inversions. 130

Two alternative evolutionary models of species' development 131
embryonic stages of species with more recently evolved organization, "replaying" the 142 evolutionary history of this body plan ( Fig. 3a; Supplementary Fig. 6

Identification of stage-associated genes 261
We defined genes preferentially expressed at a particular developmental stage as 262 stage-associated genes in each species following the method described in 8, 21 . For each 263 species, we required FPKM (fragments per kilobase of exon model per million reads 264 mapped) > 2 at a particular stage and Z-score (normalized FPKM across samples) 265 representing the difference between this stage and the rest of stages > 1.5. On average, 65%-90% of genes expressed in development were identified as stage-associated 267

Alignment of mouse developmental stage sets to developmental profiles of other species 269
To assess the transcriptome similarity in the course of developmental stages between 270 mouse (Mm) and other species, we used stage-associated and 1:1 orthologous gene to 271 align the developmental stages between mouse and other species followed by the 272 method reported in 8, 21 . The detailed description can be found in the supplementary 273 materials of 8, 21 . In brief, we calculated the pairing score using the hypergeometric test 274 to evaluate the ratio of orthologs overlapping within the stage-associated genes 275 pairwise. The pairing score can be written as: 276 Paring score = -log10 (Bonferroni corrected P-values) 277 The paring score was used to quantify the significance of the overlap on each pairwise 278 stage comparison between the mouse and another species. From the paring score, we 279 identified the relationship between mouse and other species (Fig. 1e). To check the 280 stability of this relationship, we repeated the comparison 500 times with randomly 281 assigned stage-associated genes for each stage of each species using the same 282 procedure.
According to the above scheme, we further assigned the corresponding stage 285 alignment between mouse and other species using the Needleman-Wunsch algorithm 286 with the gap penalty equal to one. To estimate the stability of alignment based on the 287 mean of the replicates, we randomly chose one individual per stage, aligned two 288 species 500 times, and calculated the frequency for each pair of alignment. The 289 resulting frequency was presented by the thickness of the line in Fig. 2a. 290

Identification of CM/PM genes 291
To identify genes with expression trajectories compatible with the continuous model 292 (CM) or progressive model (PM) predictions, we considered development-related 293 genes as mentioned in "Identification of development-related genes" with existing 294 orthologs between mouse and any other species. We then linearly aligned the 295 complete set of developmental stages in a given species (oyster, amphioxus, ciona, 296 zebrafish, frogs, turtle, and chicken) to the increasingly truncated sets of mouse 297 developmental stages, i.e., from first two to all 17 stages (Supplementary Fig. 6). To 298 this end, we interpolated 20 time points uniformly distributed across the whole 299 ontogeny of non-mouse species using cubic smoothing spline (with three degrees of 300  Furthermore, we repeated the classification process by shuffling the orthologous 320 relationships between non-mouse species and the mouse (Supplementary Fig. 7).
Specifically, we expected that by shuffling, each developmental stage of the mouse 322 would be equally represented in other species, allowing us to calculate the random 323 expectation of alignment occurrences for each stage. Thus, the final number of genes 324 aligning to each set of developmental stages of the mouse was obtained after 325 subtracting these background contaminations. Genes adhering to the continuous 326 model (CM) were defined as those best aligned between mouse and other species over 327 the complete developmental intervals. By contrast, genes following the progressive 328 model (PM) were defined by the following criteria: i) aligned best to a truncated set of 329 mouse developmental stages; ii) displayed a significantly greater number of aligning 330 genes compared to the background distributions, determined using a BH-corrected 331 p-value of less than 0.05 as a cutoff; and iii) an increase in the number of genes 332 aligning to particular sets of mouse stages compared to its adjacent sets (Fig. 3a, 333

Supplementary Data 5 and Supplementary Data 6). 334
To further check the robustness of the alignment results, we restricted our 335 analysis to evolutionarily old genes. To do so, we repeated the alignment procedure 336 using a subset of the development-related genes with inferred Earliest Ortholog Level 337 (EOL) < level 10. This conservation level corresponds to genes that appeared and 338 became fixed within the genomes before the separation of Chordata 31 . 339

Overlap of CM/PM genes between species 340
To calculate the significance of the overlap of CM or PM genes between species, we 341 sampled the same number of CM or PM genes in each species from all development-related genes 1,000 times and recalculated the number of overlapping 343 genes to obtain the empirical distribution. The overlap significance p-value was 344 calculated as the proportion of the values, which were as larger or greater than the 345 actual overlapping gene number. Given all pairs of species involved, a 346 Bonferroni-corrected p-value of less than 0.05 was used as the cutoff of the overlap 347 significance (Supplementary Data 7). 348

Amino acid conservation of CM/PM gene 349
We compared the evolutionary conservation between proteins encoded by CM and 350 PM genes using the Ka/Ks metric 32 . Since in this study, we are not studying Ka/Ks 351 variations across the evolutionary branches -we focused on the comparisons of PM 352 versus CM genes within each individual species, we conducted this analysis mainly in 353 a pairwise manner between a given species and mouse. Firstly, we extracted the 354 coding sequences of each gene based on the corresponding annotation information 355 and then translated the sequences into amino acid sequences using the function 356 "translate" implemented in the Bioconductor package "Biotrings". The longest protein 357 sequence was selected as the gene protein sequence. Next, we performed protein 358 sequence alignments between non-mouse species and the mouse using the function 359 "pairwiseAlignment" in "Biostrings", with the scoring matrix set as BLOSUM50. The 360 resulting protein alignment was translated to the nucleotide level and used as the input 361 for PAML 33 for a Ka/Ks analysis output. This procedure was conducted in a pairwise 362 manner for the mouse and a given non-mouse species to match the corresponding definitions of PM and CM genes carried out in a pairwise manner ( Supplementary Fig.  364 12). To test the robustness of our Ka/Ks calculations in the context of multiple species, 365 we performed multiple sequence alignment using Clustal Omega 34 for all protein 366 sequences mentioned above convert to nucleotide levels. We next utilized the 367 CODEML application in PAML to calculate the Ka/Ks for given cross-species 368 ortholog. The significance of the difference between PM and CM genes in each 369 species was assessed using one-sided Wilcoxon rank-sum test ( Supplementary Fig.  370 13). 371

Gene expression interpretation to mouse developmental stage 372
To compare the similarity of the expression profiles across developmental stages of 373 nine species, we used the predicted developmental stage alignment presented in Fig. 2a  374 to create a unified alignment of eight species to the mouse developmental curves. To do 375 so, we mapped 33 stages cumulatively interpolated from eight species to the full mouse 376 developmental curve fitted using cubic smoothing spline with ten degrees of freedom. 377 We then compared gene expression curves among nine species based on z-transformed 378 expression of each gene interpolated at these 33 stage points. 379

Clustering of gene expression profiles in six vertebrate and nine chordate species 380
To investigate the expression pattern diversity in nine or six species, we used 381 on "molecular function", "biological process", and "cellular component" was applied 397 independently. GO terms with BH corrected p < 0.05 were reported. We applied no 398 multiple test correction to chordate' gene cluster enrichment analysis due to low 399 statistical power of the dataset and used a more stringent nominal p-value cutoff of p 400 < 0.001. 401 For the pathway enrichment test, the reference genes were the same as for 402 GO-based analysis. We used hypergeometric test (phyper in R) to assess the 403 enrichment in each KEGG pathway. Bonferroni correction was applied for genes in 404 vertebrate clusters. Pathways with corrected p < 0.05 were reported as significantly 405 enriched. No correction was applied to genes from chordate clusters due to low 406 statistical power, and the nominal pathway enrichment p-value was set to p < 0.05. This 407 relaxed cutoff was used to assess the potential overlap between enriched functions 408 found in vertebrate and chordate clusters (Supplementary Data 4). 409

Statistical analysis and software 415
Analyses were conducted in the R environment (http://www.r-project.org/). To 416 minimize the type I error rate, we used multiple test correction for p-value calculations, 417 unless specifically indicated otherwise. The statistically significant level used for each 418 was specified in the main text. Additionally, we used Perl, Python, and R packages, 419 including "topGO", "reshape", "RColorBrewer", "ggplot2", "Biotrings", as well as 420 shell scripts, in the analyses. 421

Data availability statement 422
The RNA-seq data of Crassostrea gigas were from GEO database (accession number