Background : Linkage and linkage disequilibrium (LD) between genome regions cause dependencies among genomic markers. Due to family stratification in populations with non-random mating in livestock or crop, the standard measures of population LD such as r 2 may be biased. Grouping of markers according to their interdependence needs to account for the actual population structure in order to allow proper inference in genome-based evaluations.
Results : Given a matrix reflecting the strength of association between markers, groups are built successively using a greedy algorithm; largest groups are built at first. As an option, a representative marker is selected for each group. We provide an implementation of the grouping approach as a new function to the R package hscovar. This package enables the calculation of the theoretical covariance between biallelic markers for half- or full-sib families and the calculation of representative markers. In case studies, we have shown that the number of groups comprising dependent markers was smaller and representative SNPs were spread more uniformly over the investigated chromosome region when the family stratification was respected compared to a population-LD approach. In a simulation study, we observed that sensitivity and specificity of a genome-based association study improved if selection of representative markers took family structure into account.
Conclusions : Chromosome segments which frequently recombine in the underlying population can be identified from the matrix of pairwise dependence between markers. Representative markers can be exploited, for instance, for dimension reduction prior to a genome-based association study or the grouping structure itself can be employed in a grouped penalization approach.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5
The full text of this article is available to read as a PDF.
This is a list of supplementary files associated with this preprint. Click to download.
Additional file 1 – PDF file summarizing the derivation of family-based correlation matrices Given the population structure, half-sib families or full-sib families, the covariance matrix is analytically retrieved. Its computation using the R package hscovar is shown.
Additional file 2 — TXT file containing R code for mouse data Raw mouse data are processed and the matrix of correlation between markers is derived.
Additional file 3 — TXT file containing R code for cattle data Cattle data are processed and the matrix of correlation between markers is derived
Additional file 4 — TXT file containing R code for maize data Raw maize data are processed and the matrix of correlation between markers is derived.
Additional file 5 — TXT file containing R code for simulation study With this script, genotype and phenotype data of half-sib families are simulated and a genome-based association analysis is carried out.
Additional file 6 — TXT file containing R code for creating plots Plots of correlation matrices and population-LD matrices are produced based on results with additional files 2–5.
Additional file 7 — PDF file with additional tables Number of groups, number of groups with at least three SNPs and Calinski-Harabasz index for different simulation scenarios and varying threshold.
Loading...
Posted 06 Jan, 2021
Received 28 Dec, 2020
On 27 Dec, 2020
Invitations sent on 22 Dec, 2020
On 22 Dec, 2020
On 21 Dec, 2020
On 21 Dec, 2020
On 21 Dec, 2020
On 11 Dec, 2020
Received 08 Dec, 2020
Received 23 Nov, 2020
On 15 Nov, 2020
On 02 Nov, 2020
Invitations sent on 30 Oct, 2020
On 29 Oct, 2020
On 29 Oct, 2020
On 29 Oct, 2020
On 02 Oct, 2020
Received 24 Sep, 2020
Received 11 Sep, 2020
On 10 Sep, 2020
On 03 Sep, 2020
On 31 Aug, 2020
Invitations sent on 22 Aug, 2020
On 20 Aug, 2020
On 14 Aug, 2020
On 14 Aug, 2020
On 05 Aug, 2020
Posted 06 Jan, 2021
Received 28 Dec, 2020
On 27 Dec, 2020
Invitations sent on 22 Dec, 2020
On 22 Dec, 2020
On 21 Dec, 2020
On 21 Dec, 2020
On 21 Dec, 2020
On 11 Dec, 2020
Received 08 Dec, 2020
Received 23 Nov, 2020
On 15 Nov, 2020
On 02 Nov, 2020
Invitations sent on 30 Oct, 2020
On 29 Oct, 2020
On 29 Oct, 2020
On 29 Oct, 2020
On 02 Oct, 2020
Received 24 Sep, 2020
Received 11 Sep, 2020
On 10 Sep, 2020
On 03 Sep, 2020
On 31 Aug, 2020
Invitations sent on 22 Aug, 2020
On 20 Aug, 2020
On 14 Aug, 2020
On 14 Aug, 2020
On 05 Aug, 2020
Background : Linkage and linkage disequilibrium (LD) between genome regions cause dependencies among genomic markers. Due to family stratification in populations with non-random mating in livestock or crop, the standard measures of population LD such as r 2 may be biased. Grouping of markers according to their interdependence needs to account for the actual population structure in order to allow proper inference in genome-based evaluations.
Results : Given a matrix reflecting the strength of association between markers, groups are built successively using a greedy algorithm; largest groups are built at first. As an option, a representative marker is selected for each group. We provide an implementation of the grouping approach as a new function to the R package hscovar. This package enables the calculation of the theoretical covariance between biallelic markers for half- or full-sib families and the calculation of representative markers. In case studies, we have shown that the number of groups comprising dependent markers was smaller and representative SNPs were spread more uniformly over the investigated chromosome region when the family stratification was respected compared to a population-LD approach. In a simulation study, we observed that sensitivity and specificity of a genome-based association study improved if selection of representative markers took family structure into account.
Conclusions : Chromosome segments which frequently recombine in the underlying population can be identified from the matrix of pairwise dependence between markers. Representative markers can be exploited, for instance, for dimension reduction prior to a genome-based association study or the grouping structure itself can be employed in a grouped penalization approach.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5
The full text of this article is available to read as a PDF.
This is a list of supplementary files associated with this preprint. Click to download.
Additional file 1 – PDF file summarizing the derivation of family-based correlation matrices Given the population structure, half-sib families or full-sib families, the covariance matrix is analytically retrieved. Its computation using the R package hscovar is shown.
Additional file 2 — TXT file containing R code for mouse data Raw mouse data are processed and the matrix of correlation between markers is derived.
Additional file 3 — TXT file containing R code for cattle data Cattle data are processed and the matrix of correlation between markers is derived
Additional file 4 — TXT file containing R code for maize data Raw maize data are processed and the matrix of correlation between markers is derived.
Additional file 5 — TXT file containing R code for simulation study With this script, genotype and phenotype data of half-sib families are simulated and a genome-based association analysis is carried out.
Additional file 6 — TXT file containing R code for creating plots Plots of correlation matrices and population-LD matrices are produced based on results with additional files 2–5.
Additional file 7 — PDF file with additional tables Number of groups, number of groups with at least three SNPs and Calinski-Harabasz index for different simulation scenarios and varying threshold.
Loading...