PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data

doi:10.21203/rs.3.rs-2106876/v1

Download PDF

Research Article

PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data

https://doi.org/10.21203/rs.3.rs-2106876/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 05 Apr, 2023

Read the published version in BMC Bioinformatics →

You are reading this latest preprint version

Background

Population structure and cryptic relatedness between individuals (samples) are two major factors affecting false positives in genome-wide association studies (GWAS). In addition, population stratification and genetic relatedness in genomic selection in animal and plant breeding can affect prediction accuracy. The methods commonly used for solving these problems are principal component analysis (to adjust for population stratification) and marker-based kinship estimates (to correct for the confounding effects of genetic relatedness). Currently, many tools and software are available that analyze genetic variation among individuals to determine population structure and genetic relationships. However, none of these tools or pipelines perform such analyses in a single workflow and visualize all the various results in a single interactive web application.

Results

We developed PSReliP, a standalone, freely available pipeline for the analysis and visualization of population structure and relatedness between individuals in a user-specified genetic variant dataset. The analysis stage of PSReliP is responsible for executing all steps of data filtering and analysis and contains an ordered sequence of commands from PLINK, a whole-genome association analysis toolset, along with in-house shell scripts and Perl programs that support data pipelining. The visualization stage is provided by Shiny apps, an R-based interactive web application. In this study, we describe the characteristics and features of PSReliP and demonstrate how it can be applied to real genome-wide genetic variant data.

Conclusions

The PSReliP pipeline allows users to quickly analyze genetic variants such as single nucleotide polymorphisms and small insertions or deletions at the genome level to estimate population structure and cryptic relatedness using PLINK software and to visualize the analysis results in interactive tables, plots, and charts using Shiny technology. The analysis and assessment of population stratification and genetic relatedness can aid in choosing an appropriate approach for the statistical analysis of GWAS data and predictions in genomic selection. The various outputs from PLINK can be used for further downstream analysis. The code and manual for PSReliP are available at https://github.com/solelena/PSReliP.

Population stratification

Cryptic relatedness

Population structure analysis

Genetic relatedness investigation

Data visualization

PLINK

Shiny application

Bioinformatics pipeline

Overview of topics of PSReliP

Population structure (or population stratification) (PS) and cryptic relatedness (CR) are two basic aspects of population genetics. PS refers to the presence of systematic differences in allele frequencies between subpopulations that arise from non-random mating. CR (unknown to the investigators) occurs when some individuals are closely related, but this close relatedness is unreported. PS and CR can lead to the problem of confounding in genetic association studies (Astle and Balding [1]). A genome-wide association study (GWAS) is an approach used to evaluate the associations between specific genetic variants and particular phenotypes or diseases.

Principal components analysis (PCA) is the most widely used method to adjust for PS in GWAS (Price et al. [2]). In genetic studies, PCA is generally applied to a genomic relationship matrix (GRM). The method used in the PLINK 1.9 [3] and 2.0 [4] for the computation of the variance-standardized GRM is similar to the method implemented in the GCTA [5] (genome-wide complex trait analysis) tool (Yang et al. [6]). The GRM, estimated using GCTA and PLINK software, can be interpreted as a matrix representation of genetic relationships between individuals in a specified dataset of genetic variants (https://cnsgenomics.com/software/gcta/#Overview).

Related approaches, such as multidimensional scaling (MDS) performed on identity-by-state (IBS) pairwise distances, can also be applied to control PS in GWAS (Hellwege et al. [7]). Several examples of using MDS for PS analysis have been presented in scientific literature. For example, Linge et al. [8] used MDS to investigate PS using a dataset of 620 individuals from several peach cultivars. In that study, PS was analyzed using MDS and clustering analyses.

Both MDS and cluster analyses can be performed based on IBS pairwise distances (the genome-wide average proportion of alleles sharing IBS between any two individuals) [9]. IBS analysis is a widely used and easily applicable method to measure genetic similarity (similarity of alleles) between pairs of individuals in a population (IBS alleles are not necessarily a consequence of identity by descent (IBD)). This analysis may help understand the degree of genetic diversity in the whole population and different subpopulations.

The kinship coefficient, defined as the probability that two homologous alleles, one from each of two individuals, are identical by descent (IBD), is a classic measurement of relatedness (genetic relationships among individuals resulting from shared ancestry) and is important in many fields of biology (Speed and Balding [10]). Genetic relatedness can be calculated from the pedigree (the pedigree-based kinship) or can be estimated using genetic marker data (the marker-based kinship). Pedigree-free (marker-based) methods are preferred for estimating kinship coefficients when there are difficulties in restoring pedigrees in natural populations, or when the results of kinship analysis are used to infer relatedness in GWAS with unavailable or inaccurate pedigree information (Astle and Balding [1]). Several methods have been developed to estimate kinship coefficients from the genotypic data (Astle and Balding [1], Goudet et al. [11]).

Common estimation approaches use allele frequencies for kinship estimation, meaning that an appropriate reference population is required (Goudet et al. [11]). Other kinship estimation methods, such as the KING-robust estimator [12] (Manichaikul et al. [13]), do not use allele frequencies and can provide robust relationship inference in the presence of an unknown population substructure.

Marker-based kinship coefficient matrices (a matrix that contains the pairwise kinship coefficient between all individuals) can be used to correct hidden relatedness as a random effect in a mixed-model approach for GWAS analysis (Kang et al. [14]). The mixed-model approach, which accounts for confounding factors such as fixed effects (for PS) and random effects (kinship matrix), has been widely used in GWAS (Li and Zhu [15], Price et al. [16], Yu et al. [17]), particularly in GWAS conducted in plants and animals. In addition, PS and CR are factors that can influence the prediction of genomic selection (GS) (Windhausen et al. [18], Habier et al. [19], Werner et al. [20]). Wright’s F-statistics, including Wright's fixation index (FST), is one of the most used statistics in population and evolutionary genetics (Holsinger and Weir [21]). F-statistics, particularly FST, is commonly used to measure genetic variation in different populations (PS or the genetic differentiation of populations) (Holsinger and Weir [21], Bhatia et al. [22], Weir and Cockerham [23]).

The coefficient of inbreeding (F) of an individual is a measure of inbreeding and can be defined as the probability that two alleles at any given locus in an individual are IBD (Ochoa and Storey [24], Leutenegger et al. [25], Rousset [26]). Estimating the inbreeding coefficients of individuals in GWAS data is important for quality control (QC) when deciding whether to remove individuals with highly positive or highly negative inbreeding coefficients. Highly positive inbreeding coefficients indicated many homozygous genotypes and high levels of inbreeding. The inclusion of these individuals can influence GWAS results because the random mating assumption required for the standard GWAS test is violated. Highly negative inbreeding coefficients, which can be calculated by some estimators, indicate too many heterozygous genotypes and suggest the possibility of contamination.

Integrated approach to data analysis and visualization

In population genetics research and GWAS analysis, several analytical tools and software packages have been developed to investigate the stratification and relatedness in the population genetics studies. PLINK (Purcell et al. [9], Chang et al. [27]) is a popular and commonly used program for analyzing genetic variant data, including the detection of PS and CR. However, PLINK (like many other bioinformatics tools) provides the user with many commands to perform various analyses that require a deep understanding of the available parameters, their combinations, supported file formats, etc. To perform an in-depth computational analysis, it is necessary to execute several commands sequentially, with additional steps for data selection and filtering, changing data formats, etc.

In addition, visualization techniques and their applications are often required to interpret the results of the analyses performed. Many tools and packages with different implementations can be used to visualize biological datasets, including genetic and genomic data (Jia et al. [28], Nusrat et al. [29]). One popular web application framework widely used in various research fields is Shiny [30, 31] (https://www.rstudio.com/products/shiny/). Shiny is an open-source R package that offers the ability to develop interactive web applications (apps) with a dynamic user interface (UI) that can be run locally or deployed over the Internet. Shiny can be used in combination with Plotly's R graphing library [32] (https://plotly.com/r/) to create interactive web-based graphical representations of data, such as plots, charts, histograms, heatmaps, etc.

Integration of analysis and visualization functionalities into the same application or pipeline is an important approach used in various biomedical research areas, including genetics and genomics. There are some examples of pipelines that combine a comprehensive analysis of sequencing data and visualization capabilities. For example, Wang et al. [33] created the “CRISPR-DAV: CRISPR NGS data analysis and visualization pipeline,” which analyzes the CRISPR (clustered regularly interspaced short palindromic repeat) NGS (next generation sequencing) data and visualizes the analysis results. The pipeline itself is implemented in Perl and R and uses a set of common bioinformatics tools. Buza et al. [34] developed the “iMAP: an integrated bioinformatics and visualization pipeline for microbiome data analysis,” which performs the analysis of marker-based microbiome data using several publicly available tools and generates graphics and progress reports using various R packages and R-markdown. There are also several applications for PS and genetic relatedness analyses and the visualization of their results. However, as discussed later, these applications differ from the pipeline we have developed in terms of the types of analysis performed, functions offered to users, and their implementation.

In this study, we developed a PS and relatedness integrated pipeline, PSReliP, which analyzes and visualizes the PS and relatedness between individuals (samples) based on genome-wide genetic variant data. All analyses were performed at high speed using PLINK software in a sequential manner with programs and scripts written in-house. The Shiny web application allows users to interactively visualize the analysis results in a web browser. Herein, we described the structure of PSReliP, explained the functionality of its analysis and visualization stages and UI as well as demonstrated its application in genome-wide genetic variant data of rice varieties and Malawi cichlids.

Pipeline Structure

The PSReliP pipeline combines analysis techniques with an interactive visualization of the analysis results. Figure 1 shows a conceptual overview of the pipeline structure with distinct steps and associated output files.

The first step in the pipeline is the conversion of the variant call format (VCF) or binary variant call format (BCF) files into PLINK format files, which is later used in the data analysis process. The main steps of the analysis stage of the pipeline are: 1) QC and filtering of samples and variants; 2) calculation of basic sample statistics, such as the types of observed variants, inbreeding coefficients, etc., and performing the before and after data filtering; 3) analysis of PS using PCA and MDS, and complete-linkage hierarchical clustering of samples based on the IBS distance matrix, if selected; 4) calculation of Wright's FST; 5) calculation of the IBS distance matrix and analysis of genetic relatedness by estimating the KING kinship coefficient matrix and GRM. All the steps are performed sequentially. Once the analysis commands are completed, their results are combined with the Shiny app. R file and located in a single directory to create the Shiny application that produces interactive tables, plots, and charts of data and displays them through a web browser. In addition, the Shiny application allows the user to download the PLINK result files for evaluation and further use in other tools and software. All steps of the data analysis and visualization of our pipeline are elaborated in the following subsections.

Implementation of each stage

The proposed integrated pipeline can be divided into two stages: 1) the analysis stage, which includes a pre-analysis step and 2) the visualization stage. These stages differ in their implementation. Figure 2 outlines the implementation of PSReliP and shows the major parts of the pipeline implemented in Shell, Perl, and R using the PLINK software and several publicly available R packages.

Analysis stage

The analysis stage, which includes the pre-analysis step, is performed by two bash shell scripts that contained PLINK command lines, bash, and Unix commands and invoked in-house Perl programs. These bash shell scripts are executed from the command line on the UNIX or LINUX operating systems and take several arguments from the configuration file. The configuration file is located in the PSReliP installation directory and contains information about the paths to the PLINK executables (1.9 and 2.0), pipeline installation directory, working directory, input files, and parameter values used in the analysis and visualization processes (see Supplementary Table 1 for details). Users must edit the configuration file before executing the bash shell scripts. The details of the setting parameters are described in the configuration file.

PLINK (1.9 and 2.0) is the main software used in all the analysis steps in PSReliP. We used PLINK 2.0 in all cases; however, there are certain commands, such as --ibc, --cluster, --mds-plot, and --distance, that have not yet been implemented in PLINK 2.0; in such a case, we used version 1.9 of the PLINK software, and if these commands are implemented, we will switch the corresponding steps of the analysis to use PLINK 2.0. In the pre-analysis step of PSReliP, the VCF or BCF files areconverted into PLINK format files. This step is performed by running the first shell script that takes VCF (possibly gzipped) and BCF files as inputs, which can be either uncompressed or BGZF-compressed (supported by htslib). The main outputs of this step are PLINK 2 binary files in the following formats: PGEN, binary genotype file format; PSAM, format in which sample information is stored; and PVAR, format in which variant information is stored. The newly created PLINK 2 binary files are used as inputs for the following analysis steps. In addition, when this first shell script is run, an allele count report is created and written in PLINK .acount file (produced by –freq with ‘counts’ modifier). This file is used in the analysis stage when loading with --read-freq during the GRM calculation (--make-rel) and PCA run (--pca). Only one filter, such as ‘--max-alleles 2’, is applied in this pre-analysis processing step. It is sufficient to run the first shell script only once for a given set of genetic variants for one specified working directory to prepare the input files for the following analysis. When changing the working directory, it is necessary to start the analysis stage from the beginning and run the first shell script again.

The analysis stage is performed by running the second shell script, which executes all the analysis steps carried out by this pipeline. As mentioned above, during the analysis stage, the following processes are performed: 1) QC and filtering of samples and variants; 2) calculation of basic sample statistics; 3) analysis of PS using PCA, MDS, and clustering; 4) calculation of Wright's FST; and 5) calculation of the IBS, GRM, and KING kinship coefficient matrices. All analyses were carried out using PLINK 1.9 and 2.0 software. While running the second shell script, the PLINK, bash, and Unix commands are executed sequentially, and many of these commands take input from the previous command and produce an output for the next command. Users can alter multiple parameters used in the analysis steps by appropriately changing their values in the configuration file before running the shell script. Users can run the second shell script multiple times on the given genetic variant dataset using different parameter values and perform the analysis that best matches their data. Additionally, in-house Perl programs are called from both shell scripts to support pipelining by selecting appropriate data, formatting data, etc. The implementation of PSReliP, particularly the PLINK command lines with flags and parameters, is detailed in Supplementary Note 1. In this study, we used the results of several runs of our pipeline applied to two datasets, which are described in the following sections. The time required to complete each of these runs is shown in Tables 1 (runs of the first shell script) and 2 (runs of the second shell script).

Table 1

Parameter values, the number of samples and variants, and the required time (first shell script)
Datasets	Computing time		Max alleles	Number of samples and variants
Datasets	8 threads; 8000 MB RAM	32 threads; 32000 MB RAM	Max alleles	Samples	Loaded variants	Filtered variants
Rice Dataset	573s	358s	2	143	35,568,995	30,904,333
Cichlids dataset	778s	551s	2	120	57,365,062	49,965,032
Note: Computing time represents the time it took to complete runs on eight threads and 8000 MB RAM of memory and 32 threads and 32000 MB RAM of memory; Max alleles represents the PLINK --max-alleles flag, which filters out variants with more than a given number of alleles; Number of samples and variants represents the number of samples and variants calculated at the analysis stage. In the PSReliP pipeline, the –threads and --memory flags are used in the PLINK command lines, and the values of these parameters can be specified in the configuration file (see Supplementary Table 1 for details).

Table 2 Parameter values, the number of samples and variants, and the required time (second shell script)

Note

Computing time represents the time it took to complete runs on 8 threads and 8000 MB RAM of memory and 32 threads and 32000 MB RAM of memory; Setting parameters represents the parameters specified in the configuration file; Types of variants represents the type of variants, such as “SNPs” or “SNPs and InDels,” included in the analysis (“SNPs” indicates that the PLINK --snps-only flag was used); Geno represents the PLINK --geno flag, which filters out all variants with missing call rates exceeding the provided value; Mind represents the PLINK --mind flag, which filters out all samples with missing call rates exceeding the provided value; Maf represents the PLINK --maf flag, which filters out all variants with allele frequency below the provided threshold; Meanimpute represents the usage of the PLINK 'meanimpute' modifier to request the mean imputation of missing genotype calls (in --pca and --make-rel commands); Clustering represents the usage of the PLINK --cluster command to perform complete-linkage hierarchical clustering; Number of groups represents the number of groups/clusters provided by users or calculated by PLINK; LD-based pruning represents the usage of the PLINK --indep-pairwise command to produce a pruned subset of variants that are in approximate linkage equilibrium with each other (it takes three parameters: window size in variant count (vc) or kilobase (kb), variant count to shift the window (step size), which is required to be 1 when a kilobase window is used, and r2 threshold); Number of samples and variants represents the number of samples and variants calculated at the analysis stage; Filtered variants represents the number of variants remaining after filtering; and Filtered and pruned variants represents the number of variants remaining after filtering and LD-based pruning (if it was used).

Visualization stage

To visualize the results of the analysis, we created a web-based visualization stage for PSReliP. We implemented this stage using Shiny technology (https://shiny.rstudio.com/), which provides a dynamic and interactive UI, and developed the Shiny application, an interactive R-based web application. We used the Shiny package in combination with Plotly's R graphing library (https://plotly.com/r/), which allows the creation of interactive graphs and provides basic interactivity, such as zooming in and out, panning graphs, point value display, etc. Plotly is also capable of creating a figure that includes different types of subplots. Using the Plotly R library for basic charts, we created grouped and stacked bar charts and line plots as well as a combination of these for basic sample statistics, including GCTA inbreeding coefficient report and scatter plot for the results of PS analysis (PCA plot). In the scatter plot for PCA (bubble chart), marker sizes are variable and marker colors are mapped to a categorical variable. Using Plotly in conjunction with the ‘manhattanly’ R package (https://cran.r-project.org/web/packages/manhattanly/), Manhattan plots for Wright's FST analysis results were created. In Manhattan plots, the genetic variants are plotted with per-variant FST values against their genomic positions. Manhattan plots implemented with the ‘manhattanly’ package have the advantage of adding extra annotation information to each point in these plots. Heatmaps of IBS distances, genetic relationships, and kinship coefficients across all individuals (samples) were created using Plotly in conjunction with the ‘heatmaply’ R package (https://cran. r-project. org/web/packages/heatmaply/). Interactive heatmaps can zoom into a region of interest and allow the checking of values by hovering the mouse over a cell. To visualize the basic statistics of the samples, in addition to charts, tables are created with the ‘DT’ (DataTables) R package (https://cran.r-project.org/web/packages/DT/), which allows users to display their data as tables in the HTML pages and provides filtering, sorting, searching, and other features in the tables. The HTML pages can be saved as standalone HTML files with the necessary JavaScript and CSS embeddings. The required R packages are listed in the Availability and Requirements section of this paper and in the README file in the PSReliP installation directory.

At the end of the PSReliP analysis stage, the second shell script creates a directory with the user-specified name in the configuration file and copies the Shiny application (app.R) into the directory. The results of the analysis and the file containing the arguments for the Shiny app are copied to the ‘data’ subdirectory. Users can run the created Shiny application locally in Rstudio or deploy it in two main ways: in their own Shiny Server or in the cloud: shinyapps.io (https://shiny.rstudio.com/tutorial/written-tutorial/lesson7/). Running the Shiny application creates interactive data tables, plots, and charts and displays them in a web browser that supports Shiny, such as Google Chrome, Mozilla Firefox, Safari, Microsoft Edge, and Internet Explorer.

User Interface

To describe the functionality and application of our pipeline, along with the UI of its visualization stage, we used screenshots of the UI that appeared in the Google Chrome browser when PSReliP was run on a Shiny Server installed on our CentOS Linux. The data used in these screenshots were derived from the genome-wide genetic variant data of 143 worldwide rice samples registered in the Rice Annotation Project Database [35] (RAP-DB; https://rapdb.dna.affrc.go.jp) (details are described in the Results section).

As described above, the pipeline visualization stage generates tables, plots, charts, and heatmaps to show the results of the analysis stage. Visualizing the analysis results in a user-friendly manner is important for interpretation and optimization of the analysis process. The main functionalities of the visualization stage are shown in Figs. 3 and 4. The parameters used in the analyses are listed in Table 2 (Run A).

Figure 3 shows a screenshot of the web page shown when the Shiny app was accessed for the first time, with one exception: in this example, the [sample-based missing data reports ⑥ʹ] value was selected after the page was loaded. The parameters used in the analysis stage, which were specified in the configuration file, are shown at the top of Fig. 3 (indicated by ①). The number of samples and variants loaded as well as the number of remaining samples and variants after filtering and linkage disequilibrium (LD) pruning, which were calculated at the analysis stage, are shown at the top of Fig. 3, just below the selected parameters, and are indicated by ②. Figure 3 ③ shows the download button for PLINK 1.9 .bim file, which is an extended variant information file containing information about all variants used in the analysis. Figure 3 ④ and Fig. 4 ① show the main menu of the PSReliP UI. The four tabs on this menu correspond to the types of analysis performed in our pipeline. The tabs are as follows: 1) ‘Basic statistics’; 2) ‘Population Stratification analysis’; 3) ‘Wright's FST estimation’; and 4) ‘IBS and GRM calculation & Kinship Coefficients estimation’. In both figures (Figs. 3 ④ʹ and 4 ①ʹ), the ‘Basic statistics’ tab is selected, and the basic sample statistics are displayed. Figures 3 ⑤ and 4 ② show the radio button labeled ‘Datasets’ with two values: “original” and “after filtering.” These values correspond to the datasets displayed in this tab:

Original: the original dataset with no applied filters, which contains all the samples and variants included in the input VCF/BCF files. An example of this is shown in Fig. 3 ⑤ʹ.

After filtering: all the filters specified in the configuration file were applied to the dataset. An example of this is shown in Fig. 4 ②ʹ.

As can be seen from the two screenshots, the number of samples in the original dataset (143 in Fig. 3) decreased to 141 (Fig. 4) after filtering by the missing genotype rates maximum per-sample (--mind with a value of 0.2). The number of variants also decreased after filtering by maximum missing genotype rates per-variant, minor allele frequency, and LD-based pruning.

Our pipeline performs two types of the basic statistics analysis for both datasets: Sample variant-counts and sample-based missing data counts (Fig. 3 ⑥ and Fig. 4 ③) calculated by using “--sample-counts” and “--missing sample-only” options, respectively. As shown in Fig. 3, “Sample-based missing data reports" was selected (Fig. 3 ⑥ʹ), and the corresponding report calculated for the original dataset was displayed.

When the LD-based pruning flag was set in the configuration file, two additional analyses are performed. The analyses are as follows: 1) observed and expected homozygous/heterozygous genotype counts for each sample calculated by using “--het ‘cols = + het,+het’” option and 2) three inbreeding coefficients for each sample calculated by using “--ibc” option. As shown in Fig. 4, ‘GCTA inbreeding coefficient report’ was selected (Fig. 4 ③ʹ), and the corresponding report calculated for the filtered and LD-pruned datasets was displayed.

The ‘Basic statistics’ tab offers two types of data representation: charts and tables (Fig. 3 ⑦ and Fig. 4 ④). The ‘Table’ representation (Fig. 3 ⑦ʹ) of the missing data is shown in Fig. 3, and the ‘Chart’ representation (Fig. 4 ④ʹ) (multiple subplots) of the missing data (bar chart) and three inbreeding coefficients (scatter plots with lines) for each sample is shown in Fig. 4.

The original PLINK result files can be downloaded as ZIP files so that users can evaluate the analysis results and further use them in other tools and software (Fig. 3 ⑧). The downloaded ZIP file contains four files: .scount (sample variant-count report), .smiss (sample-based missing data report), .het (method-of-moments F coefficient estimates), and .ibc (GCTA inbreeding coefficient report). The button labeled ‘Save the chart as a standalone HTML file’ shown in Fig. 4 ⑤ allows the user to download the displayed chart as a single standalone HTML file. In addition, using the features provided by Plotly, the user can export the displayed image from the browser as an image file in the format specified in the configuration file, such as PNG, JPEG, WebP, SVG, and PDF. Other UI features of our pipeline are illustrated in the Results section.

Preparing data for case studies

To demonstrate the application of the proposed pipeline, validate its efficiency inassessing PS and CR, and illustrate its functionalities and capabilities, two case studies were conducted on rice varieties and Malawi cichlids. To prepare the data for these case studies, we first downloaded the sequencing data with associated metadata from the NCBI and EBI databases and then performed sequence alignment and variant calling using the procedure described in Supplementary Note 2. To create the dataset of rice varieties, we selected BioSample accessions of cultivars, landraces, and wild species, registered in the Rice Annotation Project Database [35] (RAP-DB, http://rapdb.dna.affrc.go.jp/) (Sakai et al. [36]), with an average depth of sequencing coverage greater than 30. To create the dataset of Malawi cichlids, we selected BioSample accessions from BioProject PRJEB1254 and PRJEB15289, for which the sampling locations were recorded in the NCBI BioSample database. All raw sequencing reads were obtained from a previous study (Malinsky et al.) [37]. We compared the data obtained by running our pipeline with the data published in that article, which is discussed in the following subsections. Details of selecting BioSample accessions and reference genomes [38, 39], downloading nucleotide sequence data, and preparing genetic variant data are described in Supplementary Note 3. Accessions from the BioProject, BioSample, and European Nucleotide Archive databases are listed in Supplementary Tables 2 and 3.

Results obtained in case studies

Analysis of genetic variants of rice varieties

The results of the analyses performed five times using different filtering and pruning options (Table 2) are described here.

The four tabs on the main menu, corresponding to the types of analysis performed in our pipeline, are shown at the top of Fig. 5 (indicated by ①), of which the ‘Population Stratification analysis’ tab was selected (Fig. 5 ①ʹ). The parameter values used in the PS analysis are listed in Table 2 (Run A).

For this analysis, we prepared three methods: PCA, normalized PCs (each eigenvector is multiplied by the square root of its eigenvalue), and MDS (Fig. 5 ②). In the example shown in this figure, PCA is selected (Fig. 5 ②ʹ). In PLINK 2.0, by default, the top 10 PCs are extracted from the variance-standardized relationship matrix, and all of these components are used in the visualization stage of our pipeline. The scatter plot in Fig. 5 is an interactive 2-component PCA plot in which the first principal component (PC1) is represented by the horizontal axis and explains 10.4% of the variance, and the second principal component (PC2) is represented by the vertical axis and explains 5.2% of the variance. See Supplementary Note 4 for details on calculating the percentage of variance explained by each PC. The 2-component PCA plot for the other PCs can be drawn by selecting the corresponding components from the two drop-down lists, as shown in Fig. 5 ③. Users can highlight one of the samples on the PCA plot by selecting it from the drop-down list (Fig. 5 ④), and the selected sample will be shown to be larger than the others (Supplementary Fig. S1a). The interactive scatter plot can display annotation information, such as sample ID, PC values, and the group or cluster number to which the sample belongs, by hovering the mouse pointer over the markers (samples) in the scatter plot (Fig. S1a, Run A in Table 2). Users can “Hide” or “Display” the IDs or names of the samples by changing the checked value in the radio buttons (Fig. 5 ⑤, Fig. S1b). In this figure, the ‘Hide’ value was checked (Fig. 5 ⑤ʹ), and accordingly the names of the samples were hidden. Additionally, users can zoom in and out of the plot using Plotly's zoom functionality (Fig. S1b, Run A in Table 2).

In our pipeline, a PCA plot is a scatter plot that maps marker colors to a categorical variable (user-defined groups or clusters calculated using PLINK). Figure 5 ⑥ shows the color legend that matches the groups defined by us with the corresponding marker colors. We grouped 143 rice varieties into 16 groups based on the rice types (e.g., indica, aus, temperate japonica, tropical japonica, and aromatic) (Supplementary Table 4), similar to the way they were grouped in the RAP-DB. As mentioned earlier, after filtering by maximum missing genotype rates per-sample (--mind with a value of 0.2), the number of samples decreased to 141 and the number of groups decreased to 14 (“MER: Oryza meridionalis” and “PUN: Oryza punctata" were excluded) (Fig. 5). The japonica varieties (JP: Oryza sativa Japonica Group, TEJ: Oryza sativa temperate japonica subgroup, TRJ: Oryza sativa tropical japonica subgroup) and indica varieties (IND: Oryza sativa Indica Group, AUS: Oryza sativa aus subgroup) were separated from each other along PC1, whereas the TRJ and TEJ groups as well as the TRJ group and indica varieties were separated along PC2 (Fig. 5). The same can be observed in the MDS plot (Supplementary Figs. S2 and S3, Run A in Table 2). Plotly provides functions to show/hide data from each group individually by clicking on corresponding legend items. We used this feature on all charts and plots in our pipeline.

For a more complete overview of the results of the PCA and MDS analyses, along with the projections of the samples on the plane defined by the first two PCs, plots of PC1 and PC3 are often used. Supplementary Fig. S4 shows plots of PC1 and PC3 obtained from the same analysis, as illustrated in Fig. 5 (Run A in Table 2). PC3 explains 3.2% of the variance. Samples from groups such as BAR (Oryza barthii), GLA (Oryza glaberrima), and GLU (Oryza glumaepatula) as well as some samples from the RUF (Oryza rufipogon) group, which are indicated in the figure by an oval, were separated along PC3 from that of other groups, including the Oryza sativa groups mentioned above (Fig. S4).

To analyze PS and CR in japonica and indica varieties used in our case study, we selected samples belonging only to five groups (JP, TEJ, TRJ, IND, and AUS) and performed the analysis stage. The resulting PCA plots are shown in Fig. 6a (Run C in Table 2) and Fig. 6b (Run E in Table 2).

Figure 6a shows a similar pattern, as shown in Fig. 5. The TEJ group (Fig. 6a ①) and indica varieties, which included the IND and AUS groups (Fig. 6a ③), were separated from each other by PC1, while the TRJ group (Fig. 6a ②) was separated from the others by the PC2. Figure 6b shows the PCA plot for the same data (Runs E in Table 2); however, the marker colors in Fig. 6b indicate the clusters calculated using PLINK. The clustering function or user-specified groups can be set as a parameter in the configuration file. In addition, users can specify the number of clusters that are biologically interesting or easily interpretable. As shown in Fig. 6b, the TEJ and IND groups are divided into two separate clusters along the first and second PCs, respectively, which coincides with the PCA analysis.

As with other types of analysis, on the ‘Population Stratification analysis’ tab, we prepared a button to download analysis result files (Fig. 5 ⑦), which were either PLINK output files such as .eigenvec (PCs), .eigenval (eigenvalues), and .mds or generated by our pipeline as a normalized_plink_pca.txt file, which contained normalized PCs calculated using PLINK output files.

As described above, during the analysis stage of our pipeline, the PLINK --fst command is executed, and the results are visualized in the ‘Wright's FST estimates’ tab (Fig. 7 ①ʹ, Supplementary Fig. S5) of the tabs panel (Fig. 7 ①). The parameter values used in FST analysis are listed in Table 2 (Run C).

Users can select one of the pairs of subpopulations (groups or clusters of samples) by choosing the pair from the drop-down list (Fig. 7 ② and Fig. S5 ①), and Wright's FST value between the pairs of selected subpopulations (pairwise FST) is displayed in the text box immediately below this drop-down list. In this example, this value was 0.235 for the IND and TEJ groups.

The PLINK --fst command with the 'report-variants' modifier calculates the per-variant FST estimates, which isused in our pipeline if the number of groups/clusters is ≤ 5 (to control the output size). The FST values for each variant between pairs of the selected subpopulations are shown in the Manhattan plot (Fig. 7 and Fig. S5). Variants with 'nan' Fst values were removed from the FST plot, and negative FST values were set to zero. The plot does not load in the browser if large number of genetic variants are included in the analysis (after filtering and pruning). Therefore, we plot chromosomes/contigs one at a time or the entire genome region only if the number of variants is ≥ 100 and ≤ 100,000. Users can switch these views by changing the corresponding values from the drop-down list (Figs. 7 ③ and S5 ②). According to the selected values of this list, Fig. 7 shows the distribution of FST on the Manhattan plot for all chromosomes, while Fig. S5 shows the Manhattan plot for only chromosome 9. To reduce the loading time of the Manhattan plot containing a large amount of data, we added a drop-down list with a range of FST values (0–0.9) (Figs. 7 ④ and S5 ③). Accordingly, only those variants with FST values ≥ 0.1 were displayed in the plot shown in Fig. S5. For a particular variant, an FST value near 1 indicated that each of the two populations was fixed for a different allele at that locus, similar to the variant shown in Fig. S5 ④. The legend colors in the plot (Fig. 7 ⑤) correspond to chromosome numbers. The download button labeled ‘Save original data for a selected pair of subpopulations as a zip file’ (Fig. 7 ⑥) allows the user to download files obtained with the PLINK --fst command and containing FST estimates between the two selected subpopulations. It can be a single file, that is,. fst.summary (all-population-pairs Wright's FST report) or two files, .fst.summary and .fst.var (per-variant Wright's FST report for one population pair), depending on the number of groups/clusters (per-variant FST estimates are calculated if the number of groups/clusters is < 5). The user can find the original files in the ‘data’ subdirectory of the Shiny app directory.

To illustrate how the FST values depend on the variants used in the analysis, we ran our pipeline on the same input subset (Run C in Table 2) without LD-based pruning and with the --maf parameter of 0.01 (Run D in Table 2). For comparison, we placed the HUDSON_FST values (between-population FST estimates) obtained from the two runs in Table 3, which shows that the FST values are significantly lower in the LD pruned data, and this result is consistent with those presented in the scientific literature, which is discussed in the Discussion section.

Table 3

Pairwise Hudson's FST between groups in the dataset of genetic variants of rice varieties
POP1	POP2	HUDSON_FST^(a)	HUDSON_FST (LD pruned variants set^b)
AUS	IND	0.34	0.094
AUS	JP	0.569	0.202
AUS	TEJ	0.72	0.29
AUS	TRJ	0.615	0.202
IND	JP	0.46	0.161
IND	TEJ	0.605	0.235
IND	TRJ	0.486	0.15
JP	TEJ	0.124	0.053
JP	TRJ	0.259	0.149
TEJ	TRJ	0.412	0.203
Note: ^a Data were not pruned for LD and the --maf value was set to 0.01 (Run D in Table 2); ^b The parameters used are listed in Table 2 (Run C).

Fig. 8 shows an example of the genetic similarity between individuals (samples) and the genetic relatedness between them (Run A in Table 2).

Users can display the results of these analyses by selecting the tab ‘IBS and GRM calculation & Kinship Coefficients estimation’ (Fig. 8 ①ʹ) on the main menu (Fig. 8 ①). For this type of analysis, we prepared three methods (Fig. 8 ②): IBS matrix calculation (Fig. 8 ②ʹ), GRM, and KING-robust kinship estimation. The results of these three types of calculations are displayed on interactive heatmaps, where samples can be ordered in two ways, ‘PLINK Sample ID’ and ‘Group/Cluster number’ (Fig. 8 ③). The list of sample IDs/Names on the heatmap can be in the same order as in the matrix derived from the corresponding PLINK command, or samples can be reordered according to the groups/clusters to which they are assigned (Fig. 8 ③ʹ). The gradient color bar in the heatmap (Fig. 8 ④) maps the colors to their corresponding values. The individual values for the two samples and the IDs/Names for these samples are displayed when the mouse is hovered over the colored square (Fig. 8 ⑤). The result files from these analyses can be downloaded by clicking the ‘Save data as a zip file’ button (Fig. 8 ⑥). The files are as follows: .mibs (identity-by-state matrix), .rel (relationship matrix), .king (KING-robust kinship coefficient matrix), and corresponding .id (Sample ID list) files.

Supplementary Fig. S6 shows a heatmap of the GRM (Fig. S6 ①) for the same data, as shown in Fig. 8 (Run A in Table 2). The samples are also ordered by ‘Group/Cluster number’ (Fig. S6 ②).

Scientific literature suggests that pruning data based on LD values is an important step for IBS and GRM calculations (see Discussion section). To illustrate how the values of IBS and GRM depend on the variants used in the analysis, we ran our pipeline on the same input subset (Run A in Table 2) without using LD-based pruning and with the --maf parameter of 0.01 (Run B in Table 2). The resulting IBS matrix (Supplementary Fig. S7 ①), reordered by the ‘Group/Cluster number’ (Fig. S7 ②), is shown in Fig. S7. As can be observed from the two heatmaps (Fig. 8 and Fig. S7), without LD-based pruning, the overall values of the IBS matrix are higher, as in the example of the same pair of samples (Fig. 8 ⑤ and Fig. S7 ③).

Unlike the calculation of IBS and GRM, LD-based pruning is not recommended for estimating KING-robust kinship coefficients (see Discussion section). Supplementary Fig. S8 shows an example of a heatmap of KING-robust kinship coefficients (Fig. S8 ①) for data that were not pruned for LD, and the --maf value was set to 0.01 (Run B in Table 2). The samples on the heatmap were ordered by ‘Group/Cluster number’ (Fig. S8 ②). Note that the KING kinship coefficients are scaled so that duplicate samples have a kinship of 0.5, rather than 1 (Fig. S8 ③). In this heatmap, most of the individual pairs had a kinship coefficient of 0, and only a few pairs had a kinship coefficient > 0.25, such as a pair of accessions of the Koshihikari variety with a KING kinship coefficient of 0.358 (Fig. S8 ④). As described above, to explore CR between individuals, we prepared GRM and the kinship matrix with pairwise KING kinship coefficients, so that users can choose between them depending on the objectives of the study and the type of downstream analysis.

Analysis of genetic variants of Malawi cichlids

In this subsection, we present the results of the analysis of the dataset containing the genetic variants of Malawian cichlids. These results can be viewed in the following tables: ‘Basic statistics,’ ‘Population Stratification analysis,’ and ‘Wright's FST estimation.’ We ran our pipeline multiple times using different filtering and pruning options (Table 2 Dataset of genetic variants of Malawi cichlids). The parameters used in Runs F and G differed in the number of groups, whereas the parameters used in Run H differed from those used in Run I by using LD-based pruning and values of the --maf parameter.

In the stacked bar chart of a ‘Sample variant-count report’ for the original dataset (Run F in Table 2), most of the observed variants belonged to the class of ‘Hom-REF genotype’ (homozygous reference allele; reference: M_zebra_UMD2a) (blue color), and only a small number of observed variants belong to other classes, such as ‘Hom-ALT SNP’ (orange), ‘Het. SNP genotype’ (green), and ‘diploid non-SNP variant’ (red) (Fig. S9). This result is consistent with that observed in a previous study showing that the genetic diversity in cichlid fish species is low [37]. Ten samples had more variants other than the ‘Hom-REF genotype’ class than the other samples. All were samples of the outgroup Astatotilapia species.

In the “Basic statistics” tab, the inbreeding coefficient (F) of each sample estimated based on the expected and observed individual heterozygosity can be shown in the grouped bar chart and line plot (Fig. S10, Run F in Table 2). These multiple subplots can be displayed by selecting the “Method-of-moments F coefficient estimates” report for the “After filtering” dataset on the “Basic statistics” tab. For most samples, the observed number of heterozygous genotypes was significantly lower than the expected number of heterozygous genotypes, and these values were approximately equal in few samples (Fig. S10). Conversely, the observed number of homozygous genotypes was higher than expected in most samples. The low expected heterozygosity found in this analysis indicates high homozygosity and low genetic diversity in Malawi cichlids [37].

Given the information on sampling locations, we divided the samples into 17 groups according to their geographic locations (see Supplementary Table 5 for details). We also grouped the samples into seven eco-morphological groups in the same way as described in the article (Malinsky et al. [37]) and in an additional outgroup “Astatotilapia” (see Supplementary Table 5 for details). By applying these two sets of groups, we analyzed the same input dataset using the same filtering and pruning parameters. The PCA plots for PC1 and PC2 obtained from these runs are shown in Supplementary Fig. S11 (Run F in Table 2, 17 groups) and Supplementary Fig. S12 (Run G in Table 2, eight groups). The groups separated from each other by the first and second PCs correlated well with the eco-morphological groups indicated by colors (Fig. S12). In contrast, positions in the PCA plot and sampling locations were not correlated, possibly because the species used in the case study belonged to different genera and were genetically diverged despite living in the same region (Fig. S11). However, when examining individuals only from the species Astatotilapia calliptera of the genus Astatotilapia, the PCA plot showed some association between genetic similarity among these individuals and sampling locations (Supplementary Fig. S13). The PCA plot shown in Fig. S13 is an enlarged view of the region shown in Fig. S11, indicated by an oval in the upper-right corner. All samples from this region belonged to the Astatotilapia calliptera species, as shown in Fig. S12 (indicated by ①).

It is interesting to note that individuals from the A. calliptera species group from the Lake Malawi catchment (green) are closer to individuals from the mbuna group (red) compared to those from the “outgroup Astatotilapia” (orange) that belonged to the same species of A. calliptera but were sampled from outside Lake Malawi (Fig. S12). This is in agreement with the observations of Malinsky et al. [37].

We ran PSReliP on the samples without “outgroup Astatotilapia” (Fig. S12 ②) and compared the results of PCA with those of Malinsky et al. [37] (Fig. 9, Run H in Table 2).

The results were in agreement in terms of the distribution of groups relative to each other and to both axes of the PCs, the values of eigenvectors, and the percentage of variance explained by each component. For example, PC1 explained 9.7% of the variance in our case (7.9% in Malinsky et al. [37]), whereas PC2 explained 4.2% of the variance in both cases. We also created pairwise PCA plots of the top 3–10 PCs for 109 accessions of Malawi cichlids (Supplementary Fig. S14 a-d) and found that our results were similar to those reported by Malinsky et al. [37].

A comparison of the FST values obtained from the two runs (Runs H and I in Table 2) with those presented by Malinsky et al. [37] showed that the FST values between the A. calliptera group and other groups shown in Malinsky et al. [37] were between the values obtained from the two runs (Table 4). Regarding the FST values between other groups, the values presented by Malinsky et al. [37] were slightly lower than the values we obtained for the data pruned for LD (Run H in Table 2).

Table 4

Pairwise Hudson's FST between groups in the dataset of genetic variants of Malawi cichlids
POP1	POP2	HUDSON_FST^(a)	HUDSON_FST (LD pruned variants set^b)
Astatotilapia_calliptera	deep_benthic	0.319	0.216
Astatotilapia_calliptera	Diplotaxodon	0.416	0.338
Astatotilapia_calliptera	mbuna	0.278	0.217
Astatotilapia_calliptera	Rhamphochromis	0.507	0.451
Astatotilapia_calliptera	shallow_benthic	0.308	0.197
Astatotilapia_calliptera	utaka	0.350	0.250
deep_benthic	Diplotaxodon	0.283	0.241
deep_benthic	mbuna	0.262	0.230
deep_benthic	Rhamphochromis	0.416	0.359
deep_benthic	shallow_benthic	0.103	0.081
deep_benthic	utaka	0.135	0.106
Diplotaxodon	mbuna	0.354	0.317
Diplotaxodon	Rhamphochromis	0.477	0.402
Diplotaxodon	shallow_benthic	0.300	0.261
Diplotaxodon	utaka	0.312	0.260
mbuna	Rhamphochromis	0.461	0.425
mbuna	shallow_benthic	0.258	0.224
mbuna	utaka	0.290	0.253
Rhamphochromis	shallow_benthic	0.417	0.360
Rhamphochromis	utaka	0.456	0.390
shallow_benthic	utaka	0.143	0.118
Note: ^a Data were not pruned for LD and the --maf value was set to 0.01 (Run I in Table 2); ^b The parameters used are listed in Table 2 (Run H).

Thus, we conclude that the results obtained by our pipeline are consistent with those shown in the original study, which confirms the ability of our pipeline to perform reliable analyses.

Understanding PS and CR is important in many application areas, including population genetics research, GWAS, and GS. There are several examples in the literature in which tools or pipelines have been created to visualize PC and/or CR after performing appropriate analyses based on genetic variant data. Steinig et al. [40] have developed “NETVIEW P” that is a comprehensive implementation of NETVIEW (the network analysis and visualization pipeline, Neuditschko et al. [41]) in Python. NETVIEW P combines data QC with the construction of population networks that can efficiently visualize the genetic structure within and between populations, including relationships and structure at the family level. In the NETVIEW P tool, the parameters and options can be set via the command line, and the input formats are the PED and MAP files from the PLINK software or a simple SNP matrix. The final network visualizations were based on the layouts provided by Cytoscape (https://cytoscape.org/), and the final network files were loaded into a compatible visualization platform.

Another example of such tools is “KinVis” (Ullah et al. [42]), which was designed to analyze GWAS input data to identify relatedness. The KinVis tool was developed as an R-Shiny application; in this respect, it is similar to the implementation of the visualization stage of our pipeline. However, KinVis differs from our pipeline in terms of the types of analyses performed, their implementation, and the functions offered to the users.

In contrast to the tools mentioned above, in our pipeline, QC and a wide range of analyses such as PCA, MDS, Wright's FST estimation, calculation of IBS and GRM, inbreeding and kinship coefficient estimation, and some other calculations for the analysis process are executed in the same single workflow using PLINK software as well as the in-house shell scripts and PERL programs for data pipelining.

However, combining such diverse analyses in one workflow has its own challenges because different filtering criteria and, accordingly, different sets of genetic variants are considered optimal for different analyses. For example, for some analyses used in GWAS, such as IBD Estimation, inbreeding coefficient estimation (f), and PCA, better results can be obtained by selecting and analyzing markers that are not in LD with each other (Malomane et al. [43], [44]). Hence, LD-based pruning is effective in these types of analyses.

However, LD-based pruning is not recommended for some kinship estimation methods, such as the estimation of KING-robust kinship (Manichaikul et al. [13]).

Regarding FST estimation, an inappropriate choice of criteria for the selection of genetic variants can lead to different FST values, particularly in cases where the population harbors a large number of rare variants (Bhatia et al. [22]), and LD-based pruned data underestimates FST values (Malomane et al. [43]).

With our pipeline, users can set criteria for filtering and pruning samples and genetic variants by modifying parameters and performing the analysis multiple times, thereby overcoming these challenges.

Visualization of the results of the various analyses is also performed by a single interactive web application implemented in R using the Shiny, Plotly, and other packages. Using the interactive features of Plotly inside a Shiny app allows researchers and developers to quickly create various visualizations commonly used in bioinformatics, such as dendrograms, heatmaps, Manhattan, volcano plots, etc., and publish or share them as an interactive web application.

The visualization stage of our pipeline allows users to view detailed analysis results in a web browser in the form of interactive tables, plots, and charts, which helps them quickly understand and interpret their data and decide which approaches are best for downstream analysis.

In this study, we developed a computational and visualization pipeline that enables users to infer PS and estimate CR at high speeds and to visualize processed input and output data interactively. To build the pipeline, existing software and R packages, such as PLINK, Shiny, Plotly, and others, were used together with our self-written programs and scripts. In addition, various parameters were prepared using PSReliP for analysis and visualization processes. Therefore, it is expected that by changing the parameters and repeatedly performing the corresponding analysis, it will be easier to select a suitable set of variants and samples and the most appropriate PCA and kinship coefficients for further use in downstream analyses, including GWAS and GS. To facilitate this process, PSReliP provides the functionality to download all the original PLINK results as ZIP files, in addition to the ability to download tables, plots, and charts of the analyzed data as image files. To validate PSReliP, investigate its performance, and illustrate its various features, we conducted case studies on rice and Malawi cichlid accessions. The findings from these case studies demonstrate the ability of the proposed pipeline to correctly estimate PS and CR in the datasets provided. Designed as an integrated platform for data analysis and visualization, we hope that this pipeline becomes a useful tool for analyzing genome-wide genetic variant data (single-nucleotide polymorphisms and small insertions and deletions) to identify PS and CR and help avoid potential problems associated with them that may arise in further analysis.

BCF: Binary Variant Call Format

CR: Cryptic Relatedness

FST: Fixation Index

GRM: Genomic Relationship Matrix

GS: Genomic Selection

GWAS: Genome-Wide Association Studies

IBS: Identity-By-State

IBD: Identity-By-Descent

LD Linkage Disequilibrium

MDS: Multidimensional Scaling

PCA: Principal Components Analysis

PCs: Principal Components

PS: Population Structure

VCF: Variant Call Format

Availability and requirements

Project name: PSReliP

Project home page: https://github.com/solelena/PSReliP

Operating system(s): Unix (Linux)

Programming language: Bash, R, Perl

Tool: PLINK 1.9: 19 Oct 2020 or later, PLINK 2.0: 8 Jun 2021 or later

R and R packages: R (3.6+), shiny (1.4.0.2+), plotly (4.9.2.1+), manhattanly (0.2.0+), heatmaply (1.1.0+), ggplot2 (3.3.0+), DT (0.16+), stringr (1.4.0)

Web browsers: Google Chrome, Mozilla Firefox, and Microsoft Edge

License: GNU GPL v3.0

Any restrictions to use by non-academics: license needed

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

The dataset of rice varieties and the dataset of Malawi cichlids used as input for PSReliP are available in the European Nucleotide Archive (ENA). Accessions from the BioProject, BioSample, and ENA databases are listed in Supplementary Tables 2 and 3.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

ES and HS designed the pipeline. ES implemented the pipeline, performed the analyses, and drafted the manuscript. HS supervised this study and revised the manuscript.

Acknowledgements

This study was supported by a grant from the Ministry of Agriculture, Forestry and Fisheries of Japan [Smart-breeding system for Innovative Agriculture] (BAC1001).

Astle W, Balding DJ. Population structure and cryptic relatedness in genetic association studies. Stat Sci 2009;24(4):451–71. doi:10.1214/09-STS307.
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006;38(8):904–9. doi:10.1038/ng1847.
Chang C. PLINK 1.90 beta. 2022. https://www.cog-genomics.org/plink/1.9/. Accessed 9 Feb 2022.
Chang C. PLINK 2.00 alpha. 2022. https://www.cog-genomics.org/plink/2.0/. Accessed 9 Feb 2022.
Westlake University: Yang Lab. GCTA: a tool for Genome-wide Complex Trait Analysis. https://yanglab.westlake.edu.cn/software/gcta/#Overview (2021). Accessed 9 Feb 2022.
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A tool for genome-wide complex trait analysis. Am J Hum Genet 2011;88(1):76–82. doi:10.1016/j.ajhg.2010.11.011.
Hellwege JN, Keaton JM, Giri A, Gao X, Velez Edwards DRV, Edwards TL. Population stratification in genetic association studies. Curr Protoc Hum Genet 2017;95(1):1.22.1–1.22.23. doi:10.1002/cphg.48.
da Silva Linge C, Cai L, Fu W, Clark J, Worthington M, Rawandoozi Z, Byrne DH, Gasic K. Multi-locus genome-wide association studies reveal fruit quality hotspots in peach genome. Front Plant Sci 2021;12:644799. doi:10.3389/fpls.2021.644799.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81(3):559–75. doi:10.1086/519795.
Speed D, Balding DJ. Relatedness in the post-genomic era: is it still useful? Nat Rev Genet 2015;16(1):33–44. doi:10.1038/nrg3821.
Goudet J, Kay T, Weir BS. How to estimate kinship. Mol Ecol 2018;27(20):4121–35. doi:10.1111/mec.14833.
Chen WM. KING tutorial: relationship inference. In: KING: Kinship-Based INference for Gwas. 2021. https://www.kingrelatedness.com/manual.shtml. Accessed 9 Feb 2022.
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics 2010;26(22):2867–73. doi:10.1093/bioinformatics/btq559.
Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat Genet 2010;42(4):348–54. doi:10.1038/ng.548.
Li GX, Zhu HJ. Genetic studies: the linear mixed models in genome-wide association studies. TOBIOIJ 2013;7(1):27–33. doi:10.2174/1875036201307010027.
Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet 2010;11(7):459–63. doi:10.1038/nrg2813.
Yu J, Pressoir G, Briggs WH, Vroh Bi IV, Yamasaki M, Doebley JF, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet 2006;38(2):203–8. doi:10.1038/ng1702.
Windhausen VS, Atlin GN, Hickey JM, Crossa J, Jannink JL, Sorrells ME, et al. Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 (Bethesda) 2012;2(11):1427–36. doi:10.1534/g3.112.003699.
Habier D, Fernando RL, Dekkers JCM. The impact of genetic relationship information on genome-assisted breeding values. Genetics 2007;177(4):2389–97. doi:10.1534/genetics.107.081190.
Werner CR, Gaynor RC, Gorjanc G, Hickey JM, Kox T, Abbadi A et al. How population structure impacts genomic selection accuracy in cross-validation: implications for practical breeding. Front Plant Sci 2020;11:592977. doi:10.3389/fpls.2020.592977.
Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet 2009;10(9):639–50. doi:10.1038/nrg2611.
Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting FST: the impact of rare variants. Genome Res 2013;23(9):1514–21. doi:10.1101/gr.154831.113.
Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution 1984;38(6):1358–70. doi:10.1111/j.1558-5646.1984.tb05657.x.
Ochoa A, Storey JD. Estimating FST and kinship for arbitrary population structures. PLOS Genet 2021;17(1):e1009241. doi:10.1371/journal.pgen.1009241.
Leutenegger AL, Prum B, Génin E, Verny C, Lemainque A, Clerget-Darpoux F, Thompson EA. Estimation of the inbreeding coefficient through use of genomic data. Am J Hum Genet 2003;73(3):516–23. doi:10.1086/378207.
Rousset F. Inbreeding and relatedness coefficients: what do they measure? Heredity 2002;88(5):371 – 80. doi:10.1038/sj.hdy.6800065.
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 2015;4:7. doi:10.1186/s13742-015-0047-8.
Jia L, Yao W, Jiang Y, Li Y, Wang Z, Li H, et al. Development of interactive biological web applications with R/Shiny. Brief Bioinform 2022;23(1):bbab415. doi:10.1093/bib/bbab415.
Nusrat S, Harbig T, Gehlenborg N. Tasks, techniques, and tools for genomic data visualization. Comput Graph Forum 2019;38(3):781–805. doi:10.1111/cgf.13727.
RStudio, PBC: Shiny. https://www.rstudio.com/products/shiny/ (2022). Accessed 9 Feb 2022.
RStudio, PBC: Shiny from RStudio. https://shiny.rstudio.com/ (2020). Accessed 9 Feb 2022.
Plotly. Plotly R Open source graphing Library. https://plotly.com/r/ (2022). Accessed 9 Feb 2022.
Wang X, Tilford C, Neuhaus I, Mintier G, Guo Q, Feder JN, Kirov S. CRISPR-DAV: CRISPR NGS data analysis and visualization pipeline. Bioinformatics 2017;33(23):3811–12. doi:10.1093/bioinformatics/btx518.
Buza TM, Tonui T, Stomeo F, Tiambo C, Katani R, Schilling M, et al. Imap: an integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics 2019;20(1):374. doi:10.1186/s12859-019-2965-4.
National Agriculture and Food Research Organization: Rice Annotation Project Database (RAP-DB). https://rapdb.dna.affrc.go.jp (2017). Accessed 9 Feb 2022.
Sakai H, Lee SS, Tanaka T, Numa H, Kim J, Kawahara Y, et al. Rice annotation project database (RAP-DB): an integrative and interactive database for rice genomics. Plant Cell Physiol 2013;54(2):e6. doi:10.1093/pcp/pcs183.
Malinsky M, Svardal H, Tyers AM, Miska EA, Genner MJ, Turner GF, Durbin R. Whole-genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. Nat Ecol Evol 2018;2(12):1940–55. doi:10.1038/s41559-018-0717-x.
Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, et al. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice (N Y) 2013;6(1):4. doi:10.1186/1939-8433-6-4.
Conte MA, Kocher TD. An improved genome reference for the African cichlid, Metriaclima zebra. BMC Genomics 2015;16(1):724. doi:10.1186/s12864-015-1930-5.
Steinig EJ, Neuditschko M, Khatkar MS, Raadsma HW, Zenger KR. Netview p: a network visualization tool to unravel complex population structure using genome-wide SNPs. Mol Ecol Resour 2016;16(1):216–27. doi:10.1111/1755-0998.12442.
Neuditschko M, Khatkar MS, Raadsma HW. NetView: a high-definition network-visualization approach to detect fine-scale population structures from genome-wide patterns of variation. PLOS ONE 2012;7(10):e48375. doi:10.1371/journal.pone.0048375.
Ullah E, Aupetit M, Das A, Patil A, Al Muftah NA, Rawi R, Saad M, Bensmail H. KinVis: a visualization tool to detect cryptic relatedness in genetic datasets. Bioinformatics 2019;35(15):2683–85. doi:10.1093/bioinformatics/bty1028.
Malomane DK, Reimer C, Weigend S, Weigend A, Sharifi AR, Simianer H. Efficiency of different strategies to mitigate ascertainment bias when using SNP panels in diversity studies. BMC Genomics 2018;19(1):22. doi:10.1186/s12864-017-4416-9.
Double Helix Inc, The Golden Helix Blog: Determining the best LD Pruning options. http://blog.goldenhelix.com/jbartole/determining-best-ld-pruning-options/ (2016). Accessed 9 Feb 2022.

Table 2 is available in the Supplementary Files section.

No competing interests reported.

Download PDF

Journal Publication

published 05 Apr, 2023

Read the published version in BMC Bioinformatics →

Editorial decision: Major revision
10 Nov, 2022
Reviews received at journal
04 Nov, 2022
Reviewers agreed at journal
24 Oct, 2022
Reviewers invited by journal
23 Oct, 2022
Editor assigned by journal
20 Oct, 2022
Editor invited by journal
13 Oct, 2022
Submission checks completed at journal
13 Oct, 2022
First submitted to journal
27 Sep, 2022

You are reading this latest preprint version

PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data

Status:

Journal Publication

Version 1

Abstract

Background

Results

Conclusions

Figures

Background

Implementation

Results

Discussion

Conclusions

Abbreviations

Declarations

References

Table 2

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1