Preparing data for case studies
To demonstrate the application of the proposed pipeline, validate its efficiency inassessing PS and CR, and illustrate its functionalities and capabilities, two case studies were conducted on rice varieties and Malawi cichlids. To prepare the data for these case studies, we first downloaded the sequencing data with associated metadata from the NCBI and EBI databases and then performed sequence alignment and variant calling using the procedure described in Supplementary Note 2. To create the dataset of rice varieties, we selected BioSample accessions of cultivars, landraces, and wild species, registered in the Rice Annotation Project Database [35] (RAP-DB, http://rapdb.dna.affrc.go.jp/) (Sakai et al. [36]), with an average depth of sequencing coverage greater than 30. To create the dataset of Malawi cichlids, we selected BioSample accessions from BioProject PRJEB1254 and PRJEB15289, for which the sampling locations were recorded in the NCBI BioSample database. All raw sequencing reads were obtained from a previous study (Malinsky et al.) [37]. We compared the data obtained by running our pipeline with the data published in that article, which is discussed in the following subsections. Details of selecting BioSample accessions and reference genomes [38, 39], downloading nucleotide sequence data, and preparing genetic variant data are described in Supplementary Note 3. Accessions from the BioProject, BioSample, and European Nucleotide Archive databases are listed in Supplementary Tables 2 and 3.
Results obtained in case studies
The results of the analyses performed five times using different filtering and pruning options (Table 2) are described here.
The four tabs on the main menu, corresponding to the types of analysis performed in our pipeline, are shown at the top of Fig. 5 (indicated by ①), of which the ‘Population Stratification analysis’ tab was selected (Fig. 5 ①ʹ). The parameter values used in the PS analysis are listed in Table 2 (Run A).
For this analysis, we prepared three methods: PCA, normalized PCs (each eigenvector is multiplied by the square root of its eigenvalue), and MDS (Fig. 5 ②). In the example shown in this figure, PCA is selected (Fig. 5 ②ʹ). In PLINK 2.0, by default, the top 10 PCs are extracted from the variance-standardized relationship matrix, and all of these components are used in the visualization stage of our pipeline. The scatter plot in Fig. 5 is an interactive 2-component PCA plot in which the first principal component (PC1) is represented by the horizontal axis and explains 10.4% of the variance, and the second principal component (PC2) is represented by the vertical axis and explains 5.2% of the variance. See Supplementary Note 4 for details on calculating the percentage of variance explained by each PC. The 2-component PCA plot for the other PCs can be drawn by selecting the corresponding components from the two drop-down lists, as shown in Fig. 5 ③. Users can highlight one of the samples on the PCA plot by selecting it from the drop-down list (Fig. 5 ④), and the selected sample will be shown to be larger than the others (Supplementary Fig. S1a). The interactive scatter plot can display annotation information, such as sample ID, PC values, and the group or cluster number to which the sample belongs, by hovering the mouse pointer over the markers (samples) in the scatter plot (Fig. S1a, Run A in Table 2). Users can “Hide” or “Display” the IDs or names of the samples by changing the checked value in the radio buttons (Fig. 5 ⑤, Fig. S1b). In this figure, the ‘Hide’ value was checked (Fig. 5 ⑤ʹ), and accordingly the names of the samples were hidden. Additionally, users can zoom in and out of the plot using Plotly's zoom functionality (Fig. S1b, Run A in Table 2).
In our pipeline, a PCA plot is a scatter plot that maps marker colors to a categorical variable (user-defined groups or clusters calculated using PLINK). Figure 5 ⑥ shows the color legend that matches the groups defined by us with the corresponding marker colors. We grouped 143 rice varieties into 16 groups based on the rice types (e.g., indica, aus, temperate japonica, tropical japonica, and aromatic) (Supplementary Table 4), similar to the way they were grouped in the RAP-DB. As mentioned earlier, after filtering by maximum missing genotype rates per-sample (--mind with a value of 0.2), the number of samples decreased to 141 and the number of groups decreased to 14 (“MER: Oryza meridionalis” and “PUN: Oryza punctata" were excluded) (Fig. 5). The japonica varieties (JP: Oryza sativa Japonica Group, TEJ: Oryza sativa temperate japonica subgroup, TRJ: Oryza sativa tropical japonica subgroup) and indica varieties (IND: Oryza sativa Indica Group, AUS: Oryza sativa aus subgroup) were separated from each other along PC1, whereas the TRJ and TEJ groups as well as the TRJ group and indica varieties were separated along PC2 (Fig. 5). The same can be observed in the MDS plot (Supplementary Figs. S2 and S3, Run A in Table 2). Plotly provides functions to show/hide data from each group individually by clicking on corresponding legend items. We used this feature on all charts and plots in our pipeline.
For a more complete overview of the results of the PCA and MDS analyses, along with the projections of the samples on the plane defined by the first two PCs, plots of PC1 and PC3 are often used. Supplementary Fig. S4 shows plots of PC1 and PC3 obtained from the same analysis, as illustrated in Fig. 5 (Run A in Table 2). PC3 explains 3.2% of the variance. Samples from groups such as BAR (Oryza barthii), GLA (Oryza glaberrima), and GLU (Oryza glumaepatula) as well as some samples from the RUF (Oryza rufipogon) group, which are indicated in the figure by an oval, were separated along PC3 from that of other groups, including the Oryza sativa groups mentioned above (Fig. S4).
To analyze PS and CR in japonica and indica varieties used in our case study, we selected samples belonging only to five groups (JP, TEJ, TRJ, IND, and AUS) and performed the analysis stage. The resulting PCA plots are shown in Fig. 6a (Run C in Table 2) and Fig. 6b (Run E in Table 2).
Figure 6a shows a similar pattern, as shown in Fig. 5. The TEJ group (Fig. 6a ①) and indica varieties, which included the IND and AUS groups (Fig. 6a ③), were separated from each other by PC1, while the TRJ group (Fig. 6a ②) was separated from the others by the PC2. Figure 6b shows the PCA plot for the same data (Runs E in Table 2); however, the marker colors in Fig. 6b indicate the clusters calculated using PLINK. The clustering function or user-specified groups can be set as a parameter in the configuration file. In addition, users can specify the number of clusters that are biologically interesting or easily interpretable. As shown in Fig. 6b, the TEJ and IND groups are divided into two separate clusters along the first and second PCs, respectively, which coincides with the PCA analysis.
As with other types of analysis, on the ‘Population Stratification analysis’ tab, we prepared a button to download analysis result files (Fig. 5 ⑦), which were either PLINK output files such as .eigenvec (PCs), .eigenval (eigenvalues), and .mds or generated by our pipeline as a normalized_plink_pca.txt file, which contained normalized PCs calculated using PLINK output files.
As described above, during the analysis stage of our pipeline, the PLINK --fst command is executed, and the results are visualized in the ‘Wright's FST estimates’ tab (Fig. 7 ①ʹ, Supplementary Fig. S5) of the tabs panel (Fig. 7 ①). The parameter values used in FST analysis are listed in Table 2 (Run C).
Users can select one of the pairs of subpopulations (groups or clusters of samples) by choosing the pair from the drop-down list (Fig. 7 ② and Fig. S5 ①), and Wright's FST value between the pairs of selected subpopulations (pairwise FST) is displayed in the text box immediately below this drop-down list. In this example, this value was 0.235 for the IND and TEJ groups.
The PLINK --fst command with the 'report-variants' modifier calculates the per-variant FST estimates, which isused in our pipeline if the number of groups/clusters is ≤ 5 (to control the output size). The FST values for each variant between pairs of the selected subpopulations are shown in the Manhattan plot (Fig. 7 and Fig. S5). Variants with 'nan' Fst values were removed from the FST plot, and negative FST values were set to zero. The plot does not load in the browser if large number of genetic variants are included in the analysis (after filtering and pruning). Therefore, we plot chromosomes/contigs one at a time or the entire genome region only if the number of variants is ≥ 100 and ≤ 100,000. Users can switch these views by changing the corresponding values from the drop-down list (Figs. 7 ③ and S5 ②). According to the selected values of this list, Fig. 7 shows the distribution of FST on the Manhattan plot for all chromosomes, while Fig. S5 shows the Manhattan plot for only chromosome 9. To reduce the loading time of the Manhattan plot containing a large amount of data, we added a drop-down list with a range of FST values (0–0.9) (Figs. 7 ④ and S5 ③). Accordingly, only those variants with FST values ≥ 0.1 were displayed in the plot shown in Fig. S5. For a particular variant, an FST value near 1 indicated that each of the two populations was fixed for a different allele at that locus, similar to the variant shown in Fig. S5 ④. The legend colors in the plot (Fig. 7 ⑤) correspond to chromosome numbers. The download button labeled ‘Save original data for a selected pair of subpopulations as a zip file’ (Fig. 7 ⑥) allows the user to download files obtained with the PLINK --fst command and containing FST estimates between the two selected subpopulations. It can be a single file, that is,. fst.summary (all-population-pairs Wright's FST report) or two files, .fst.summary and .fst.var (per-variant Wright's FST report for one population pair), depending on the number of groups/clusters (per-variant FST estimates are calculated if the number of groups/clusters is < 5). The user can find the original files in the ‘data’ subdirectory of the Shiny app directory.
To illustrate how the FST values depend on the variants used in the analysis, we ran our pipeline on the same input subset (Run C in Table 2) without LD-based pruning and with the --maf parameter of 0.01 (Run D in Table 2). For comparison, we placed the HUDSON_FST values (between-population FST estimates) obtained from the two runs in Table 3, which shows that the FST values are significantly lower in the LD pruned data, and this result is consistent with those presented in the scientific literature, which is discussed in the Discussion section.
Table 3
Pairwise Hudson's FST between groups in the dataset of genetic variants of rice varieties
POP1 | POP2 | HUDSON_FST(a) | HUDSON_FST (LD pruned variants setb) |
AUS | IND | 0.34 | 0.094 |
AUS | JP | 0.569 | 0.202 |
AUS | TEJ | 0.72 | 0.29 |
AUS | TRJ | 0.615 | 0.202 |
IND | JP | 0.46 | 0.161 |
IND | TEJ | 0.605 | 0.235 |
IND | TRJ | 0.486 | 0.15 |
JP | TEJ | 0.124 | 0.053 |
JP | TRJ | 0.259 | 0.149 |
TEJ | TRJ | 0.412 | 0.203 |
Note: a Data were not pruned for LD and the --maf value was set to 0.01 (Run D in Table 2); b The parameters used are listed in Table 2 (Run C). |
Fig. 8 shows an example of the genetic similarity between individuals (samples) and the genetic relatedness between them (Run A in Table 2).
Users can display the results of these analyses by selecting the tab ‘IBS and GRM calculation & Kinship Coefficients estimation’ (Fig. 8 ①ʹ) on the main menu (Fig. 8 ①). For this type of analysis, we prepared three methods (Fig. 8 ②): IBS matrix calculation (Fig. 8 ②ʹ), GRM, and KING-robust kinship estimation. The results of these three types of calculations are displayed on interactive heatmaps, where samples can be ordered in two ways, ‘PLINK Sample ID’ and ‘Group/Cluster number’ (Fig. 8 ③). The list of sample IDs/Names on the heatmap can be in the same order as in the matrix derived from the corresponding PLINK command, or samples can be reordered according to the groups/clusters to which they are assigned (Fig. 8 ③ʹ). The gradient color bar in the heatmap (Fig. 8 ④) maps the colors to their corresponding values. The individual values for the two samples and the IDs/Names for these samples are displayed when the mouse is hovered over the colored square (Fig. 8 ⑤). The result files from these analyses can be downloaded by clicking the ‘Save data as a zip file’ button (Fig. 8 ⑥). The files are as follows: .mibs (identity-by-state matrix), .rel (relationship matrix), .king (KING-robust kinship coefficient matrix), and corresponding .id (Sample ID list) files.
Supplementary Fig. S6 shows a heatmap of the GRM (Fig. S6 ①) for the same data, as shown in Fig. 8 (Run A in Table 2). The samples are also ordered by ‘Group/Cluster number’ (Fig. S6 ②).
Scientific literature suggests that pruning data based on LD values is an important step for IBS and GRM calculations (see Discussion section). To illustrate how the values of IBS and GRM depend on the variants used in the analysis, we ran our pipeline on the same input subset (Run A in Table 2) without using LD-based pruning and with the --maf parameter of 0.01 (Run B in Table 2). The resulting IBS matrix (Supplementary Fig. S7 ①), reordered by the ‘Group/Cluster number’ (Fig. S7 ②), is shown in Fig. S7. As can be observed from the two heatmaps (Fig. 8 and Fig. S7), without LD-based pruning, the overall values of the IBS matrix are higher, as in the example of the same pair of samples (Fig. 8 ⑤ and Fig. S7 ③).
Unlike the calculation of IBS and GRM, LD-based pruning is not recommended for estimating KING-robust kinship coefficients (see Discussion section). Supplementary Fig. S8 shows an example of a heatmap of KING-robust kinship coefficients (Fig. S8 ①) for data that were not pruned for LD, and the --maf value was set to 0.01 (Run B in Table 2). The samples on the heatmap were ordered by ‘Group/Cluster number’ (Fig. S8 ②). Note that the KING kinship coefficients are scaled so that duplicate samples have a kinship of 0.5, rather than 1 (Fig. S8 ③). In this heatmap, most of the individual pairs had a kinship coefficient of 0, and only a few pairs had a kinship coefficient > 0.25, such as a pair of accessions of the Koshihikari variety with a KING kinship coefficient of 0.358 (Fig. S8 ④). As described above, to explore CR between individuals, we prepared GRM and the kinship matrix with pairwise KING kinship coefficients, so that users can choose between them depending on the objectives of the study and the type of downstream analysis.
In this subsection, we present the results of the analysis of the dataset containing the genetic variants of Malawian cichlids. These results can be viewed in the following tables: ‘Basic statistics,’ ‘Population Stratification analysis,’ and ‘Wright's FST estimation.’ We ran our pipeline multiple times using different filtering and pruning options (Table 2 Dataset of genetic variants of Malawi cichlids). The parameters used in Runs F and G differed in the number of groups, whereas the parameters used in Run H differed from those used in Run I by using LD-based pruning and values of the --maf parameter.
In the stacked bar chart of a ‘Sample variant-count report’ for the original dataset (Run F in Table 2), most of the observed variants belonged to the class of ‘Hom-REF genotype’ (homozygous reference allele; reference: M_zebra_UMD2a) (blue color), and only a small number of observed variants belong to other classes, such as ‘Hom-ALT SNP’ (orange), ‘Het. SNP genotype’ (green), and ‘diploid non-SNP variant’ (red) (Fig. S9). This result is consistent with that observed in a previous study showing that the genetic diversity in cichlid fish species is low [37]. Ten samples had more variants other than the ‘Hom-REF genotype’ class than the other samples. All were samples of the outgroup Astatotilapia species.
In the “Basic statistics” tab, the inbreeding coefficient (F) of each sample estimated based on the expected and observed individual heterozygosity can be shown in the grouped bar chart and line plot (Fig. S10, Run F in Table 2). These multiple subplots can be displayed by selecting the “Method-of-moments F coefficient estimates” report for the “After filtering” dataset on the “Basic statistics” tab. For most samples, the observed number of heterozygous genotypes was significantly lower than the expected number of heterozygous genotypes, and these values were approximately equal in few samples (Fig. S10). Conversely, the observed number of homozygous genotypes was higher than expected in most samples. The low expected heterozygosity found in this analysis indicates high homozygosity and low genetic diversity in Malawi cichlids [37].
Given the information on sampling locations, we divided the samples into 17 groups according to their geographic locations (see Supplementary Table 5 for details). We also grouped the samples into seven eco-morphological groups in the same way as described in the article (Malinsky et al. [37]) and in an additional outgroup “Astatotilapia” (see Supplementary Table 5 for details). By applying these two sets of groups, we analyzed the same input dataset using the same filtering and pruning parameters. The PCA plots for PC1 and PC2 obtained from these runs are shown in Supplementary Fig. S11 (Run F in Table 2, 17 groups) and Supplementary Fig. S12 (Run G in Table 2, eight groups). The groups separated from each other by the first and second PCs correlated well with the eco-morphological groups indicated by colors (Fig. S12). In contrast, positions in the PCA plot and sampling locations were not correlated, possibly because the species used in the case study belonged to different genera and were genetically diverged despite living in the same region (Fig. S11). However, when examining individuals only from the species Astatotilapia calliptera of the genus Astatotilapia, the PCA plot showed some association between genetic similarity among these individuals and sampling locations (Supplementary Fig. S13). The PCA plot shown in Fig. S13 is an enlarged view of the region shown in Fig. S11, indicated by an oval in the upper-right corner. All samples from this region belonged to the Astatotilapia calliptera species, as shown in Fig. S12 (indicated by ①).
It is interesting to note that individuals from the A. calliptera species group from the Lake Malawi catchment (green) are closer to individuals from the mbuna group (red) compared to those from the “outgroup Astatotilapia” (orange) that belonged to the same species of A. calliptera but were sampled from outside Lake Malawi (Fig. S12). This is in agreement with the observations of Malinsky et al. [37].
We ran PSReliP on the samples without “outgroup Astatotilapia” (Fig. S12 ②) and compared the results of PCA with those of Malinsky et al. [37] (Fig. 9, Run H in Table 2).
The results were in agreement in terms of the distribution of groups relative to each other and to both axes of the PCs, the values of eigenvectors, and the percentage of variance explained by each component. For example, PC1 explained 9.7% of the variance in our case (7.9% in Malinsky et al. [37]), whereas PC2 explained 4.2% of the variance in both cases. We also created pairwise PCA plots of the top 3–10 PCs for 109 accessions of Malawi cichlids (Supplementary Fig. S14 a-d) and found that our results were similar to those reported by Malinsky et al. [37].
A comparison of the FST values obtained from the two runs (Runs H and I in Table 2) with those presented by Malinsky et al. [37] showed that the FST values between the A. calliptera group and other groups shown in Malinsky et al. [37] were between the values obtained from the two runs (Table 4). Regarding the FST values between other groups, the values presented by Malinsky et al. [37] were slightly lower than the values we obtained for the data pruned for LD (Run H in Table 2).
Table 4
Pairwise Hudson's FST between groups in the dataset of genetic variants of Malawi cichlids
POP1 | POP2 | HUDSON_FST(a) | HUDSON_FST (LD pruned variants setb) |
Astatotilapia_calliptera | deep_benthic | 0.319 | 0.216 |
Astatotilapia_calliptera | Diplotaxodon | 0.416 | 0.338 |
Astatotilapia_calliptera | mbuna | 0.278 | 0.217 |
Astatotilapia_calliptera | Rhamphochromis | 0.507 | 0.451 |
Astatotilapia_calliptera | shallow_benthic | 0.308 | 0.197 |
Astatotilapia_calliptera | utaka | 0.350 | 0.250 |
deep_benthic | Diplotaxodon | 0.283 | 0.241 |
deep_benthic | mbuna | 0.262 | 0.230 |
deep_benthic | Rhamphochromis | 0.416 | 0.359 |
deep_benthic | shallow_benthic | 0.103 | 0.081 |
deep_benthic | utaka | 0.135 | 0.106 |
Diplotaxodon | mbuna | 0.354 | 0.317 |
Diplotaxodon | Rhamphochromis | 0.477 | 0.402 |
Diplotaxodon | shallow_benthic | 0.300 | 0.261 |
Diplotaxodon | utaka | 0.312 | 0.260 |
mbuna | Rhamphochromis | 0.461 | 0.425 |
mbuna | shallow_benthic | 0.258 | 0.224 |
mbuna | utaka | 0.290 | 0.253 |
Rhamphochromis | shallow_benthic | 0.417 | 0.360 |
Rhamphochromis | utaka | 0.456 | 0.390 |
shallow_benthic | utaka | 0.143 | 0.118 |
Note: a Data were not pruned for LD and the --maf value was set to 0.01 (Run I in Table 2); b The parameters used are listed in Table 2 (Run H). |
Thus, we conclude that the results obtained by our pipeline are consistent with those shown in the original study, which confirms the ability of our pipeline to perform reliable analyses.