Genome properties of key oil palm (Elaeis guineensis Jacq.) breeding populations

A good knowledge of the genome properties of the populations makes it possible to optimize breeding methods, in particular genomic selection (GS). In oil palm (Elaeis guineensis Jacq), the world’s main source of vegetable oil, this would provide insight into the promising GS results obtained so far. The present study considered two complex breeding populations, Deli and La Mé, with 943 individuals and 7324 single-nucleotide polymorphisms (SNPs) from genotyping-by-sequencing. Linkage disequilibrium (LD), haplotype sharing, effective size (Ne), and fixation index (Fst) were investigated. A genetic linkage map spanning 1778.52 cM and with a recombination rate of 2.85 cM/Mbp was constructed. The LD at r2=0.3, considered the minimum to get reliable GS results, spanned over 1.05 cM/0.22 Mbp in Deli and 0.9 cM/0.21 Mbp in La Mé. The significant degree of differentiation existing between Deli and La Mé was confirmed by the high Fst value (0.53), the pattern of correlation of SNP heterozygosity and allele frequency among populations, and the decrease of persistence of LD and of haplotype sharing among populations with increasing SNP distance. However, the level of resemblance between the two populations over short genomic distances (correlation of r values between populations >0.6 for SNPs separated by <0.5 cM/1 kbp and percentage of common haplotypes >40% for haplotypes <3600 bp/0.20 cM) likely explains the superiority of GS models ignoring the parental origin of marker alleles over models taking this information into account. The two populations had low Ne (<5). Population-specific genetic maps and reference genomes are recommended for future studies.


Introduction
The African oil palm (Elaeis guineensis Jacq.) is a perennial tropical monocot oil-producing plant that belongs to the Arecaceae family. It originated from the Gulf of Guinea. It is naturally cross-pollinated, monoecious, allogamous, and diploid, with a chromosome number of 2n = 2x = 32 and having a genome sequence of 1.8 gigabases (Ithnin and Din 2020). The economic life span of oil palm ranges from 25 to 30 years and it is mainly cultivated in humid tropical zones of the world (Barcelos et al. 2015).
The total world vegetable oil production is currently around 200 million metric tons (MT), led by oil palm (37.5%), followed by soybean oil (30%), rapeseed oil (14%), and sunflower oil (9.5%) (Statista 2021). The world demand for oil palm is expected to reach 240 million tons by 2050 (Corley 2009). Oil palm produces an average oil yield of 4 tons per hectare every year, which is approximately 7-10 times higher than soybean (Babu and Mathur 2016;Corley and Tinker 2016;Pirker et al. 2016). Oil palm is an important source of edible oil with over 80% of the products used in the food industry (cooking/frying oil, shortenings, margarine, and confectionery fats), and the rest used in the chemical industry for the formulation of soaps and detergents, pharmaceutical products, cosmetics, biodiesel, etc. (Basiron 2007;Corley 2009;Soh et al. 2017).
Nowadays, the cultivation of oil palm relies on hybrid varieties because they have a high yield per hectare. Group A and group B are the two heterotic groups involved in the development of hybrid cultivars of African oil palm (Nyouma et al. 2019). Group A mostly consists of the Deli parental population, which is derived from four individuals from an unknown area of Africa planted in 1848 in Indonesia (Hartley 1988). The selection of the Deli population, mainly for yield, started in the early twentieth century. Group B is made up of several African breeding populations. African populations resulted from a limited number of founders collected during the first half of the twentieth century. La Mé population originated from individuals collected in the Bingerville region of the Ivory Coast between 1924 and 1930, with three founders for the La Mé individuals considered here (Cochard et al. 2009). In both A and B groups, inbreeding was commonly used, by using selfing or by mating with related selected individuals (Corley and Tinker 2016).
Despite its wide adaptation and importance, oil palm production and productivity are generally far from their potential due to biotic and abiotic practical constraints. Climate change, land shortage, labor shortage, and diseases (in particular vascular wilt, ganoderma, and bud rot) are among the major factors hindering the current and future yield of oil palm across the world (Corley 2009;Paterson et al. 2013; Barcelos et al. 2015;Kwong et al. 2016;Pirker et al. 2016). In addition, genetic improvement through the conventional method in oil palm is constrained by several factors, in particular a long breeding cycle (>15 years) and a limited number of tested individuals Jin et al. 2016;Seng et al. 2016).
To provide a solution while ensuring a sustainable future, marker-assisted breeding has recently been introduced into oil palm breeding programs (Soh et al. 2017) with, for instance, the detection of quantitative trait loci (QTLs) controlling oil yield, quality, vegetative growth, and resistance to diseases (Pootakham et al. 2015;Tisné et al. 2015;Ithnin et al. 2017;Bai et al. 2018b;Daval et al. 2021) and genomic selection (Nyouma et al. 2019).
Genomic selection (GS) is a marker-assisted selection (MAS) method with a high density of markers on the entire genome so that at least one marker is in linkage disequilibrium with each QTL (Meuwissen et al. 2001). It is the most effective MAS method to improve quantitative traits (Heffner et al. 2009). Studies on the application of GS in oil palm brought positive results. Thus, GS could improve oil palm clonal selection (Nyouma et al. 2020) and the selection of parents to use for hybrid crossings (Cros et al. 2017). Generally, GS in oil palm can enhance selection intensity and/or shorten the generation interval, thus increasing the annual genetic gain (Nyouma et al. 2019). The different studies carried out have provided a significant amount of information concerning the conditions of implementation of GS in this species. For example, in Deli and La Mé, GS has been implemented with relatively small training populations (<150) and low marker density (<2000 SNPs) (Cros et al. 2017;Nyouma et al. 2020); and models ignoring the parental origin of marker alleles were found to be more accurate than models accounting for this information (Nyouma et al. 2020(Nyouma et al. , 2022. To better understand the GS results in the populations involved, a detailed study of their genome properties would be of interest, particularly regarding linkage disequilibrium (LD), effective size (N e ), haplotype sharing, and fixation index (F st ), which are known to affect GS accuracy.
Linkage disequilibrium is defined as the non-random association of alleles at two or more loci (Weir 1979;Slatkin 2008). The concept of GS relies heavily on LD between QTLs and DNA markers, and a good knowledge of LD in the breeding population is necessary to optimize GS (Nakaya and Isobe 2012;Technow et al. 2014;Li and Kim 2015;Bejarano et al. 2018). The LD pattern is shaped by genetic factors, i.e., mutations and historical events that occurred during domestication and population formation, including natural and artificial selection, drift, migration, and nonrandom mating, as well as by non-genetic factors such as marker ascertainment bias (Flint-Garcia et al. 2003;Gupta et al. 2005;Mackay and Powell 2007;Slatkin 2008).
Effective size (N e ) is defined as the size of an idealized Wright-Fisher population that would give rise to the same extent of random genetic drift or rate of inbreeding as the actual population (Wright 1931;Caballero 1994;Falconer and Mackay 1996). Low N e leads to high rates of genetic drift and inbreeding in a population, making N e one of the major factors influencing LD, and consequently the accuracy of GS (Grattapaglia 2014). N e can be inferred from LD (Corbin et al. 2012), as there is an inverse relationship between LD and N e . For a given marker density, training population size, and trait, LD and GS prediction accuracy are higher in populations with low N e than in populations with high N e (Solberg et al. 2008;Daetwyler et al. 2010;Wientjes et al. 2013;Grattapaglia 2014;Lin et al. 2014). So far, in oil palm, N e was only estimated in the Deli population (Cros et al. 2014) and there is no information about N e for La Mé population.
Haplotypes correspond to two or more SNP alleles that tend to be inherited as a unit in the chromosome (Bernardo 2010). Haplotype sharing helps estimate the genetic resemblance between individuals and is a natural extension of identity by descent (Xu and Guan 2014). Several authors showed that the aggregation of SNPs into haplotypes can increase the prediction accuracy in animals (Calus et al. 2008;Cuyabano et al. 2014;Teissier et al. 2020) and in plant species that were allogamous or with high multiallelism (Matias et al. 2017;Ballesta et al. 2019). Also, consistency of linkage phases between QTL and marker alleles among populations is required to pool them to get a larger population for genetic studies (De Roos et al. 2009;Technow et al. 2012).
The fixation index (F st ) is the correlation between gametes chosen randomly from within the same sub-population relative to the entire population (Wright 1931;Jakobsson et al. 2013;Weir and Goudet 2017). It is used to identify loci with divergent allelic frequencies between two or more populations. It helps to understand the genetic differentiation among groups. It ranges from 0 (no correlation, i.e., gametes within sub-populations are no more similar than gametes among sub-populations) to 1 (each sub-population is fixed with a different allele). F st analysis has been used to identify regions of the genome associated with domestication and selective sweeps associated with breeding (Yan et al. 2017). It can also improve GS and genome-wide association studies (GWAS). For example, Chang et al. (2019) showed that prioritizing and weighting SNPs based on their F st values can increase the accuracy of genomic predictions by more than 5%. Yan et al. (2017) in soybean found that combining GWAS and fixation index analysis helped identify QTLs for seed weight.
The goal of this study is to characterize the genome properties of two major oil palm breeding populations, Deli and La Mé, focusing on key parameters for genomic predictions, namely LD, haplotype sharing, N e , and F st .

Plant material and experimental design
The plant material used in this experiment consisted of individuals of the Deli and La Mé populations and their hybrid crosses. It comprised 943 genotyped individuals with 423 Deli, 140 La Mé, and 380 Deli × La Mé hybrid individuals. The Deli and La Mé populations used here were complex, involving several families with varying sizes and levels of relatedness. Thus, the Deli individuals belonged to 89 families of full-sibs with a mean size of 4.8 individuals (ranging from one to 60 individuals). The La Mé individuals belonged to 24 families of full-sibs with a mean size of 5.8 individuals (ranging from one to 31 individuals). Detailed pedigree information of these two populations is known over several generations (Cros et al. 2017). The Deli × La Mé hybrid individuals were obtained crossing 67 and 63 of these Deli and La Mé individuals, respectively, according to an incomplete factorial design. The hybrid individuals belonged to 101 crosses comprising on average 3.8 individuals (ranging from one to 10). For the construction of the genetic map, all the genotyped Deli, La Mé, and Deli × La Mé individuals were used, as well as the non-genotyped individuals comprised in their pedigree, for a total of 1788 individuals. For the other parts of the study, we only used the genotyped individuals of the Deli and La Mé breeding populations. The plant material was located partly in North Sumatra, on the SOCFINDO estate (Indonesia), and partly in Benin, on the INRAB research station of Pobè.

Genotypic data
Molecular data were obtained by genotyping by sequencing (GBS) (He et al. 2014). DNA extraction, GBS, and SNP calling were performed based on the procedure described in Cros et al. (2017). The sequence data were processed using Tassel GBS version 5.2.44 (Glaubitz et al. 2014). The reference genome of Singh et al. (2013) was used for alignment with Bowtie2 software (Langmead and Salzberg 2012). Biallelic SNPs were the only variants kept. SNP data points with depth below 10 were set to missing and only SNPs with less than 50% missing data in the two breeding populations were kept. SNPs with the sum of depth per datapoint above 550,000 and SNPs with 100% heterozygote genotypes were discarded. Individuals with more than 50% missing data were removed. Finally, we obtained 7324 SNP markers, common to both breeding populations. It included 5598 SNPs located on the anchored sequences of the genome (i.e., the 16 chromosomes of Singh et al. (2013) (Table 1). The average percentage of missing data per SNP was 11% in Deli and 13% in La Mé, with median values below 5% (Supplementary Figure 1).

Construction of the genetic map
The genetic map was made using LepMAP3 software version 0.4 (Rastas 2017). First, module ParentCall2 was used to call the parental oil palm genotypes, with parameters removeNonInformative=1, to remove the non-informative markers (monomorphic or homozygous in both parents), and halfSibs=1. Second, the Filtering2 module was used to remove SNPs segregating in a non-Mendelian fashion using dataTolerance=0.001. Third, the SeparateChromosomes2 module assigned markers into linkage groups (LGs) by computing all pairwise LOD scores between markers and joined markers with LOD scores higher than the user-given parameter LodLimit, which was set to eight. Fourth, the JoinSin-gles2All module assigned single markers to the existing LGs by computing LOD scores between every single marker and markers from the existing LGs, using lodLimit=4 and iter-ate=1. Finally, OrderMarkers2 ordered the markers within each LG by maximizing the likelihood of the data given the order and using the Kosambi mapping function for the conversion of recombination frequencies into map distances (centiMorgan, cM) (Rastas 2017). To join the maps of both male and female parents, the sexAveraged argument was set to 1. The individuals that were associated with outlier values in terms of the number of crossing-overs were identified in preliminary analysis and removed before the map construction. The markers which created large gaps at the top or bottom part of the LGs were discarded according to software developer recommendations. This resulted in a genetic map were the 16 largest LGs had largely higher numbers of SNPs than the remaining LGs, which were discarded to keep a genetic map with the number of LGs corresponding to the number of chromosomes of oil palm.

Comparison of genetic and physical maps
The genetic map and physical map, showing the positions on the reference genome of Singh et al. (2013), were visualized using the R package LinkageMapView (Ouellette et al. 2018). We used MareyMap (Siberchicot et al. 2017) to plot the genetic position of the molecular markers against their physical position.

Within population linkage disequilibrium and persistence between populations
Analyses of linkage disequilibrium (LD) were performed in each breeding population using the PLINK software (Purcell et al. 2007). It computed pairwise estimates of LD by the classical measure of the squared correlation of allele frequencies at diallelic loci (r 2 ) and r. Before the computation of the r 2 , the missing data points in the Deli and La Mé individuals were imputed using Beagle5.2 (Browning et al. 2018), independently for each breeding population. For the SNPs located on the assembled parts of the genome, the r 2 values between pairs of SNPs were plotted against physical distances (Mbp). For the SNPs located on the genetic map, the r 2 values were plotted against genetic distances (cM). The LD decay was plotted up to a 0.8-Mbp distance for physical positions and 3 cM for genetic positions. The relation between the r 2 values and distances was modeled by fitting local polynomials with the functions "locpoly" and "dpill" of the R package KernSmooth 2.23 (Wand 1995), as done for example in Yamamoto et al. (2016). The persistence of LD between populations (De Roos et al. 2009) was measured by the correlation of the r measure of LD between populations given by PLINK (r LD ). r LD was computed between the two populations on the SNPs comprised in windows defined along with the genetic and physical maps, over a distance up to 90 cM and 50 Mbp, respectively. r LD values can vary from −1 to 1, with a value close to 1 indicating a similar LD pattern in the two populations for the SNPs located in the genomic window considered.

Haplotype sharing
This analysis was done with the SNP data phased with Beagle5.2 (Browning et al. 2018). Sliding windows were defined along the chromosomes and linkage groups, with an overlap of 50%. Fifteen window sizes were used for physical distances, from 10 Mbp to 100 bp, and seven window sizes were used for genetic distances, from 10 to 0.01 cM. The window sizes were considered by decreasing order and, for each window of a given window size, the list of haplotypes existing in each population was made after discarding the haplotypes with an actual length shorter than the next window size. In the end, to avoid redundancy that could result from the overlap between windows, only a single copy of the duplicated haplotypes (i.e., haplotypes identical in sequence and starting at the same position) was kept. Finally, the length of the haplotypes, the percentage of haplotypes common to the two populations, and, for the common haplotypes, their frequency in each population were computed. This analysis was done using a custom R script.

Effective size
The effective size was estimated with the LD method of Waples and Do (2008) implemented in the NeEstimator 2.1 software (Do et al. 2014). The computation was made separately in each population using the SNPs located on the genetic map and selecting the "random mating" option of the software. The confidence interval of N e values was obtained by the Jackknife method on samples.

Fixation index
Pairwise F st between Deli and La Mé was estimated according to Wright (1931), using the 7324 or 5598 SNPs available with MAF>1% and subsets of 100 random individuals per population, to avoid a bias in computing the F st values between an unequal number of genotyped individuals per population ). The F st was obtained using the SNPRelate R package (Zheng et al. 2012).

Allele and genotype frequencies
The distribution of minor allele frequency (MAF) in Deli and La Mé oil palm populations showed a reduction in the number of SNPs with the increase of MAF (Fig. 1). The average MAF was 0.09 for Deli and 0.14 for La Mé. In both populations, most SNPs had low MAF values. Thus, the percentage of SNPs with MAF <0.05 was 60.5% in Deli and 49.7% in La Mé.
The correlation of heterozygosity per SNPs between the two populations ( Fig. 3) showed that the majority of SNPs were, in one population, fixed or almost fixed (i.e., concentrated alongside the x and y axes) while, in the other population, they had a much larger level of heterozygosity.
Similarly, the correlation in the frequency of alternate alleles per SNP between populations demonstrated that most SNPs have distinct segregation patterns among populations, with SNPs largely concentrated alongside the x and y axes (Fig. 4). A large proportion of SNPs thus appeared fixed or almost fixed with the reference allele in one population (i.e., frequency of alternate allele equal or close to 0), while having a significant proportion of the alternate allele in the other population.

High-density genetic map
The process of construction of the genetic map yielded a set of 16 linkage groups (LGs). The genetic map comprised 4252 SNPs, spread over 2782 unique positions (Table 2 and Fig. 5, Supplementary Figure 2), and spanned 1778.52 cM. Even coverage of the genome was achieved, with an average mapping interval between adjacent positions of 0.67 cM and the largest gaps between adjacent positions ranging from 3.31 cM (LG11) to 6.66 cM (LG14). The size of the LGs ranged from 215.72 cM (LG1) to 64.75 cM (LG16) ( Table 2). The number of unique SNP positions mapped to each linkage group ranged from 87 (LG 14) to 358 (LG1), with a mean of 174.93 per linkage group. The biggest gap size between SNPs ranged from 3.31 cM (LG11) to 6.66 cM (LG14).

Comparison of genetic and physical maps
The physical and genetic orders were in general agreement, with a Spearman rank correlation above 0.7 for 15 LGs out of 16 (Table 2 and Fig. 6). However, upturns of large chromosome segments between the genetic map and the reference genome existed in a few cases, for example in chromosome EG51_16 (Fig. 6). Also, punctual disagreements between physical and genetic distances concerning a few SNPs appearing as outliers, i.e., far apart from the regression line, were observed in most chromosomes (Fig. 7).

Within-population linkage disequilibrium and persistence between populations
The decay of LD between pairs of SNPs according to the genetic distances is shown in Fig. 8. The LD reached high values (>0.6) for short distances between SNPs. It was higher in Deli than that in La Mé for all distances. For example, considering the r 2 value of 0.3, the corresponding distance between SNPs was 1.05 cM in Deli and 0.9 cM in La Mé (Fig. 8). The difference between the two populations was small for short distances and increased with the distance between markers. Similar trends were observed when plotting LD against physical distances (Fig. 9), although the r 2 values reached higher levels (i.e., around 0.80), as a consequence of the higher number of markers on the physical map than on the genetic map. The distance corresponding to r 2 =0.3 was 0.22 Mbp in Deli and 0.21 Mbp in La Mé. A high correlation of r values between populations (r LD ) was observed for close markers, i.e., r LD above 0.6 for SNPs separated by a distance <0.5 cM on the genetic map or <1 kbp on the physical map (Fig. 10). The r LD value decreased sharply with the distance between SNPs, and was thus divided by two before 2 cM and 5 Mbp, and became negligible at distances above 50 cM or 50 Mbp.

Haplotype sharing
The percentage of shared haplotypes between Deli and La Mé populations according to the length of the genomic window is represented in Figs. 11 and 12. A large proportion of haplotypes were common between pairs of populations when considering short distances. Thus, 50% of the haplotypes with length around 30 bp (Fig. 11) and 40% of the haplotypes with length around 3600 bp were common to the two populations, and 40% of the haplotypes with length around 0.20 cM were common to the two populations (Fig. 12). As expected, when the length of the haplotypes increased, the percentage of shared haplotypes between populations decreased. The decrease was fast, with the percentage of common haplotypes falling below 20% for haplotypes longer than 300 kbp and 2.5 cM.
The frequency of the common haplotypes coincided to some extent for short haplotypes, while the differences increased for longer haplotypes. Thus, among the common haplotypes identified with a window size of 100 bp, more than one-half (51.6%) of the ones with a frequency >90% in Deli also had a frequency >90% in La Mé. This value fell to 25% for haplotypes identified with a window size of 50 kbp and to 14% for a window size of 500 kbp.

Fixation index
The F st between Deli and La Mé was 0.53. Supplementary  Figure 3 showed the F st between the two populations at the chromosome level. Depending on the region of the genome considered, there were large variations in the F st among the two pairs of populations. Thus, several regions of the genome had high F st values (>0.6), in particular on chromosomes EG51_2, EG51_8, and EG51_13.

Discussion
In this paper, we characterized the genome properties of two key oil palm breeding populations, Deli and La Mé, using SNP data obtained by GBS. This genotyping approach, despite the higher rate of missing data and errors than SNP array and SSR, appears relevant for the present study, as indicated by previous articles on other species where genetic maps and LD profiles were obtained from GBS, for instance in coconut (Rajesh et al. 2021), hevea (de Souza et al. 2018, and tea (Niu et al. 2019). Also, the Lep-MAP3 software used here to generate the genetic map is particularly robust against missing data, as it does not rely on two-point analysis, and against genotypes of lower reliability due to low coverage sequencing (Rastas 2017).

Within-population linkage disequilibrium and persistence between populations
The pattern of LD is one of the utmost factors affecting both GWAS and GS since both methods rely on LD between markers and causal polymorphisms (Sorkheh et al. 2008;Hayes et al. 2009;Yadav et al. 2021). LD is thus one of the major factors that determine the number of markers required (Heffner et al. 2009;Lebedev et al. 2020). r 2 values of 0.3 are considered a minimum to get reliable results in GS studies and GWAS (Bejarano et al. 2018). Here, when considering the genetic distances, the r 2 value reached 0.3 with SNPs separated by around 1.05 cM in Deli and 0.9 cM in La Mé (Fig. 8). As our genetic map spanned 1778.52 cM, achieving this distance between adjacent SNPs requires around 1700 SNPs for Deli and 2000 SNPs for La Mé. When considering the physical distances, the r 2 value of 0.3 was achieved with SNPs separated by around 220 kbp in Deli and 210 kbp in La Mé (Fig. 9). As here the genome length covered by SNPs spanned 643 Mbp, achieving these distances between adjacent SNPs would take around 2900 SNPs in Deli and 3100 SNPs in La Mé, which can be considered close to the value obtained from the LD decay along with the genetic map. Considering that the goal should be to cover the whole genome and that the oil palm genome spans 1.8 gigabases (Singh et al. 2013), 10,000 SNPs would be enough to reach the r 2 value of 0.3 in the two populations studied here (as this corresponds to around 8200 SNPs in Deli and 8600 La Mé). The effect of marker density on the GS accuracy has already been evaluated on oil palm datasets comprising the populations considered here. It showed that, depending on the study and trait, the number of SNPs required to achieve maximum GS accuracy was found to range from 500 to 7000 (Cros et al. 2017;Nyouma et al. 2020). This is in agreement with the results obtained from the LD analysis. Our results also revealed that the speed and the magnitude of LD decay varied between the breeding populations. The two populations were submitted to a founding bottleneck of similar magnitude. A bottleneck increases LD and slows down the LD decline (Tenaillon et al. 2008). We can assume the higher value of LD in the Deli population in all genomic distances resulted from the fact that its history was marked by a larger number of generations of selection and inbreeding than in La Mé, with the bottleneck event in the Deli history dating back to 1848 against the 1920s in La Mé.
High correlations of r values between populations (r LD >0.6, corresponding to r LD 2 >0.25) were obtained considering the markers that were the closest from each other, i.e., with distances <0.5 cM on the genetic map or <1 kbp on the physical map. Similarly, a large proportion of haplotypes was common between Deli and La Mé when considering windows of reduced size, with >40% of haplotypes with lengths below around 3600 bp or 0.20 cM being common in the two populations. This explains the results of Nyouma et al. (2020Nyouma et al. ( , 2022, who found, using the same breeding populations and the same genotyping approach (GBS), that for GS predictions in oil palm it was better not to model the parental origin of marker alleles. The superiority of GS models ignoring the parental origin of marker alleles over models considering it does not imply a complete persistence of phases between markers and QTLs among populations. Indeed, models that consider the parental origin of marker alleles are more complex and require the estimation of more parameters, possibly reducing their predictive ability, despite their ability to better depict the genetic differences between the population. The current study and the previous results of Nyouma et al. (2020Nyouma et al. ( , 2022 indicate that the level of conservation of phases among the Deli and La Mé populations captured with the present marker density is high enough to favor models ignoring the parental origin of marker alleles. A similar conclusion was reached by Technow et al. (2012) in maize, to explain the cases where this type of model outperformed the population-specific allele models. To further investigate this aspect, it would be interesting to study the correlation of marker effects obtained by GS models between Deli and La Mé populations, as done for maize in Technow et al. (2014). To our knowledge, this is the first study investigating the persistence of LD and phases between oil palm populations.
Other studies investigated the pattern of LD in oil palm, in particular Teh et al. (2016) and Kwong et al. (2016), using high-density SNP arrays. However, the results are difficult to compare, as the studies involved different populations, in particular inter-group hybrids, against parental populations in our study. However, Kwong et al. (2016) included in their work two breeding populations, JL×DA and GM×DA, that were mostly of Deli origin. Their LD value decreased by 50% from around 25 to 200 kbp, i.e., in the same range as the value found in our study (around 175 kbp). A previous study considered the same breeding populations as in the present study but used SSR markers (Cochard 2008). The results were however in agreement, with Deli having the highest LD values. The consistency of these results shows that GBS is a suitable approach for LD studies, despite a higher rate of missing values (Supplementary Figure 1) and genotyping errors compared to SNP arrays and SSR, while providing much higher marker density than SSRs.

Genetic differentiation between Deli and La Mé
The F st study, the correlation of heterozygosity per SNP, the correlation of frequency of alternate allele per SNP, and the decrease of persistence of LD and of haplotype sharing with increasing SNP distance showed a significant degree of differentiation among the two oil palm breeding populations. Reciprocal recurrent selection (RRS) certainly contributed to this differentiation, as suggested by results obtained in the heterotic groups used in maize RRS. Thus, the Iowa stiff stalk and corn borer synthetic populations diverged along with RRS cycles, with the fixation of different alleles and a steady increase in F st (Labate et al. 1999;Gerke et al. 2015), and large differences in allele frequencies were also found between Dent and Flint populations (Technow et al. 2014). Our results are also in agreement with the oil palm study of Cochard et al. (2009), who concluded that the Deli population derived from a group comprising Benin, Nigeria, Cameroon, Congo, and Angola populations, while the populations west of Benin were genetically more different from those of Deli. This supports the idea that the four founders of the Deli population were collected in Central Africa rather than in West Africa (Cochard et al. 2009). The variation found in the F st profile, which reached high values (>0.6) in some genomic regions, suggests that F st is likely to be of interest for studying signatures of selection. This could help identify candidate genes, especially for traits with contrasting phenotypic values between breeding populations, such as bunch number and bunch weight between A and B groups. However, a higher SNP density seems necessary to obtain clearer profiles with more pronounced peaks (Porto-Neto et al. 2013) that could be linked to genes of interest based on the available information on oil palm annotation.

Effective size
To our knowledge, there was so far no estimate available of N e for the La Mé breeding population. The small values obtained here for the Deli and La Mé populations are not surprising given their history, with a small number of founders and under the effect of inbreeding. In Cros et al. (2014), N e was estimated for a subset of 104 Deli individuals from the population used here, with 16 SSR markers chosen on different linkage groups and the LD method of Waples and Do (2008). This gave a N e of 5 ± 1.1 (SD), i.e., similar to the result we obtained here. This indicates the robustness of the method against marker type and density.
The small N e values obtained here also explain the fact that GS can be implemented with small training populations and low marker density. Thus, in previous studies, GS models trained with data from only 108 Deli and 102 La Mé individuals were efficient enough to replace phenotypic selection before clonal trials (Nyouma et al. 2020) while GS accuracy plateaued with only 500 to 2000 SNPs, depending on the trait (Cros et al. 2017).

Comparison of genetic and physical maps
The construction of genetic linkage maps using SNP markers is common in oil palm (see, for instance, Jeennor and Volkaert 2014;Ting et al. 2014;Lee et al. 2015;Bai et al. 2018a, b;Gan et al. 2018;Ong et al. 2019;Herrero et al. 2020). The genetic linkage maps helped identify genomic regions having major genes and QTLs that control oil yield (Montoya et al. 2013;Jeennor and Volkaert 2014;Tisné et al. 2015), palm oil fatty acid composition (Singh et al. 2009;Montoya et al. 2013), vegetative growth (Ukoskit et al. 2014;Lee et al. 2015;Bai et al. 2018b;Teh et al. 2020), and resistance to diseases (Tisné et al. 2017;Daval et al. 2021). High-density maps were also used to improve the assembly of previously published genome sequences by assigning scaffolds originally unplaced (Ong et al. 2019. To our knowledge, the present study involved the largest number of individuals genotyped for the construction of a genetic map in oil palm. Another original aspect of our genetic map is the use of complex plant material including several families with varying degrees of relatedness, several generations, and different populations. In contrast, the previously published oil palm genetic maps were usually constructed from full-sib families (e.g., Watson et al. 2001;Cochard et al. 2009;Ting et al. 2013;Ukoskit et al. 2014), although Billotte et al. (2010) used a factorial design. To our knowledge, only Cochard et al. (2015) and Daval et al. (2021) constructed genetic maps from populations with similar levels of complexity. However, they used SSR markers and the CRI-MAP software (Green et al. 1990), which seems less efficient than LepMAP3, as it has problems handling large pedigrees with large numbers of bi-allelic markers particularly when there are lots of missing parental and grandparental genotypes. The map of our study is shorter than the map of Cochard et al. (2015), which reached 1935 cM and was obtained using a similar oil palm population. This might be a consequence of the marker type, as it was shown that SSRs led to inflated maps compared to SNPs (Ball et al. 2010).
The linkage map presented here, with an average marker density of one SNP in every 0.67 cM when considering unique positions, had a denser genome coverage compared to most previously published SNP oil palm genetic linkage maps, like Ting et al. (2014), with one marker in every 1.40 cM and Pootakham et al. (2015) with one marker in every 1.26 cM. However, our map is less dense than the genetic linkage maps constructed by Ong et al. (2019Ong et al. ( , 2020, with one marker in every 0.04 cM, 0.05 cM, and 0.18 cM, depending on the map, and Bai et al. (2018a), with one marker every 0.29 cM, and Herrero et al. (2020), with one marker in every 0.57 cM. Most of these variations in terms of the marker density of the genetic maps can be explained by differences in genotyping approaches and the size of the populations (Ferreira et al. 2006;Semagn et al. 2006;Seyum et al. 2021). Combining high-throughput genotyping and populations with at least 150 individuals appears as an efficient strategy to maximize marker density, as in Ong et al. (2019Ong et al. ( , 2020, Bai et al. (2018a), and the present study.
There were several upturns between the genetic and physical maps (Fig. 7). For example, LG 1, 2, 5, and 16 had large upturns for regions of the genome of more than 10 Mbp. Aside from potential genome assembly artifacts, this can be the consequence of genomic rearrangements between populations, as the reference genome (Eg5.1) was obtained on an individual of the AVROS oil palm population (Singh et al. 2013), which thus differed from the populations used for the genetic mapping (Deli and La Mé). To further investigate this aspect, we compared the position of our SNPs on Eg5.1 with their position on EgPMv6, a new version of Eg5.1 improved through the use of a high-density linkage map ), but that was made available after the beginning of the present study (Supplementary Figure 4). We found that, although some upturns existed (in particular for the smallest chromosomes), the positions on the two genomes are in general agreement. Consequently, there are still disagreements between the genetic positions obtained here and the physical positions, even considering the improved assembly. For example, LG 2 had 100% of its SNP located on the same chromosome according to Eg5.1 and EgPMv6 (Eg5.1_1 and GK000077.1, respectively), and almost identical SNP order between the two assemblies, while large upturns existed between the genetic and physical positions (Fig. 7). This aspect deserves further study, which could be done using population-specific genetic maps and reference genomes. This requires new data, with more genotyped individuals per population and new reference genomes.

Conclusion
The present study focused on two key populations used for hybrid breeding in oil palm, Deli and La Mé, and estimated genetic parameters affecting GS accuracy. A highdensity genetic map was constructed from a complex population including several families with varying sizes and levels of relatedness and with different genetic backgrounds. It included 4252 SNPs from GBS and spanned 1778.52 cM, with an average recombination rate of 2.85 cM/Mbp. The LD at r 2 =0.3, considered as the minimum to get reliable results for genomic predictions, spanned over 1.05 cM/0.22 Mbp in Deli and 0.9 cM/0.21 Mbp in La Mé. In the two populations, 10,000 SNPs would be enough to reach this level of LD. A high correlation of r values of LD between populations (r LD >0.6) was obtained considering the markers separated by short distances, i.e., <0.5 cM on the genetic map or <1 kbp on the physical map. The percentage of common haplotypes was above 40% for short haplotypes (3600 bp or 0.20 cM). This resemblance decreased with the distance between SNPs, with for example the percentage of common haplotypes falling below 20% for haplotypes longer than 300 kbp. The F st was high (0.53). Overall, the results showed strong genetic differentiation between Deli and La Mé, but the level of resemblance between them over short genomic distances likely explains the superiority of GS models ignoring the parental origin of marker alleles over models taking this information into account. The N e values of the two populations were small (<5). Population-specific genetic maps and reference genomes would be of interest for future studies.