Sequence data collection
Nucleotide sequences of 682 vertebrate APP family sequences were collected using megablast or BLASTn in May of 2018 [42]. Human APP family was set as query sequences (APP: NM_000484, APLP1: NM_001024807, APLP2: NM_001642). Clustal W implemented in MEGA7 was used for generating protein MSA files using default parameters [43, 44]. After manually removing overlapping species and non-aligned/partial sequences, a total of 590 sequences (APP: 212, APLP1: 177, APLP2: 201) remained for further analysis.
A 1:1 orthologous sequence of human APLP1 and APLP2 was chosen using the same procedure as described earlier, verified through the OMA browser for a list of gene-level orthologues [45, 46]. In contrast, fish APP linage seemed to independently experience duplication events resulting in duplicates named APPa and APPb [9]. As duplication may accelerate the heterogeneity of dN/dS value [47], we compared dN/dS of APPa/b and concluded that their insignificant dN/dS values would not affect the results of the dN/dS analysis. The APPa/b sequences were collected as mentioned above using zebrafish APPa/b (APPa: NM_131564, APPb: NM_152886) as a query (total of 29 species each).
Phylogenetic Tree Reconstruction
Phylogenetic trees were constructed using the neighbor-joining method implemented in MEGA7 [19, 38]. The evolutionary distance of the nucleotide sequence was calculated based on the maximum composite likelihood method [48]. Dayhoff matrix was used to calculate the evolutionary distance of the amino acid sequence [49]. Sites with gaps were not included in the reconstruction. To confirm the assurance of branching events, bootstrap value (resampling = 100 times) was inferred, and branches retaining bootstrap value > 50 are considered in the analysis [50].
Positive Selection Estimation Using Dn/ds Calculation
A pairwise dN/dS calculation was performed in MEGA7 by Li-Wu-Luo method with 100 bootstrap replications for each APP family divided into either clade level (fish, fish + amphibian, fish + amphibian + reptile + bird, and fish + amphibian + reptile + bird + mammal) or domain level (HBD, CuBD, E2 domain) [51]. Sites with gaps were excluded from the calculation. Bootstrap resampling was conducted, as shown above.
Maximum likelihood dN/dS calculation under branch-site model (Model A vs. Model A null) implemented in PAML was performed as described earlier, using EasyCodeML that contains GUI-based format conversion and CodeML in one platform [52, 53]. Here, MSA and phylogenetic tree data were prepared separately for each vertebrate APP family using a similar procedure as described above with three inputs. This method differs from pairwise dN/dS calculation due to its direct estimation of positive selection on a branch of interest (foreground branch) with statistical testing.
Clustering Method Using Digital Annealer
Protein sequences of the APP family paralogs were clustered using the Fujitsu’s Digital Annealer (DA), a hardware specialized for solving optimization problems [54]. Briefly, given a sequence distance matrix and cost function, DA searches for optimal bipartition of the dataset to maximize the distance between the two clusters. The distance matrix (termed the similarity graph) was first described by the evolutionary distance using pairwise sequence alignment (PSA). PSA and evolutionary distance (here, Poisson-corrected distance) calculation was performed using the Data Analysis in Molecular Biology and Evolution (DAMBE) software [25]. Thereafter, we assigned each sequence into an appropriate cluster and maximized the total distance between the two clusters; cost function was used to describe the total distance. As DA accepts binomial values for the variables, the total distance between two clusters containing n number of sequences was calculated as follows:

Principle Component Analysis (pca) And Dataset Preparation
Principle component analysis (PCA) was conducted for the vertebrate APP family sequence based on relative synonymous codon usage (RSCU), amino acid composition, and amino acid composition grouped by attributes. RSCU and GC-content were calculated in MEGA7 [19, 38]. Seven attributes, hydrophobicity, normalized Vander Waals volume, polarity, polarizability, charge, secondary structure, and solvent accessibility and their composition for each sequence, were calculated using the ProtrWeb server [55]. Then, PCA using prcomp function in stats supplied as a default package in R was applied for the dataset. The results for ProtrWeb and PCA, including factor loadings, are shown in the supplementary table 2.
Tertiary Structure Modeling And Docking Simulation On Heparin
Tertiary structures of HBD of human APP family (APP: NM_000484, APLP1: NM_001024807, APLP2: NM_001642, reptile APLP1: XM_030212233.1, amphibian APLP1: XM_026702387, and fish APLP1: XM_012862627) were modeled based on homology modeling as implemented in Phyre2 server [56]. Briefly, each HBD of the APP family was truncated manually and subsequently submitted to the webserver. Alanine substitution model for human APLP1 was manually created by changing to R80A, R82A, R83A, and R86A. The default mode was used for tertiary structure prediction. Protein models with the confidence of 100% and percent identity > 40% were chosen for further prediction, which are preferred index values for reliable models. The template crystal structure for homology modeling was HBD of APP (PDB: chain A of 1MWP). Detailed inputs and output scores for modeling are presented in supplementary table 3.
Subsequently, the predicted models were transferred to the heparin docking simulation implemented in ClusPro 2.0 server [57]. Briefly, the ligand (heparin) first determines 70000 rotation poses and parallelly translates around fixed HBD protein structure by 1.0 Å 3D grid. Then, for each grid point, the pose with the lowest interaction energy (scored by van der Waals attractive, van der Waals repulsive, and electrostatic energy) is selected. Finally, among the selected poses, neighboring poses defined as RMSD within 9 Å are clustered to determine the most preferred binding mode chosen by its clustering size. Default parameters were used for heparin docking simulation.
The surface electrostatics of protein models were calculated using the PDB2PQR server using a pH of 7.4 [58]. The visualization, including models and surface electrostatics, was conducted in PyMOL.
Coevolutionary Analysis Using Evolutionary Rate Covariation And Coexpression
The coevolutionary analysis was undertaken using evolutionary rate covariation (ERC) value. ERC value is the degree of similarity in a pair of proteins based on phylogenetic tree similarities, here predicted in webserver [33]. The dataset is generated from all available mammalian protein sequences in the UCSC server. Briefly, the correlation coefficient derived from covariation of evolutionary distance between corresponding branches of pair proteins is calculated. Notably, this server normalizes the evolutionary distance by species tree before the calculation of the correlation coefficient.
Statistical analysis
The mean dN/dS values for each group of sequences (either grouped by species: fish, fish + amphibian, fish + amphibian + reptile + bird, and fish + amphibian + reptile + bird + mammal or domain level: HBD, CuBD, E2 domain) were compared by multiple comparison using one-way ANOVA with posthoc two-tailed Tukey HSD Test (P-value: * < 0.10, ** < 0.05, *** < 0.01).
The fitness of model A and model A null were compared by likelihood ratio test (LRT) assuming test statistic 2Δℓ following χ2 distribution: positive selection model or Model A (allowing dN/dS or ω > 1 for foreground branches) and purifying selection model or Model A null (not allowing dN/dS or ω = 1 for foreground branches). Subsequently, Bayes Empirical Bayes (BEB) analysis were applied for sites in foreground branch with ω > 1 to test whether its value significantly have ω > 1 (positive selection) judged by posterior probability (Posterior probability: * > 0.90, ** > 0.95, *** > 0.99).
The mean ERC values were compared with the neuronal system’s ERC value using one-tailed Dunnett’s test (P-value: * < 0.10, ** < 0.05, *** < 0.01).