Oil palm (Elaeis guineensis Jacq.) is a perennial tropical monocot oil-producing plant that belongs to the Arecaceae family. It originated from the Gulf of Guinea. It is naturally cross-pollinated, monoecious, allogamous, and diploid, with a chromosome number of 2n = 2x = 32 and having a genome sequence of 1.8 gigabases (Ithnin and Din 2020). The economic life span of oil palm ranges from 25 to 30 years and it is mainly cultivated in humid tropical zones of the world (Barcelos et al. 2015).
Nowadays, the cultivation of oil palm relies on hybrid varieties because they have a high yield per hectare. Group A and Group B are the two heterotic groups involved in the oil palm hybrid cultivar development (Nyouma et al. 2019). Group A mostly consists of the Deli parental population, which is derived from four individuals of an unknown area of Africa planted in 1848 in Indonesia (Hartley 1988). The selection of the Deli population, mainly for yield, started in the early twentieth century. Group B is made up of several African breeding populations. African populations resulted from a limited number of founders collected during the first half of the twentieth century. La Mé population originated collected in the Bingerville region of the Ivory Coast between 1924 and 1930, with three founders the individuals considered here (Cochard et al. 2009). In both A and B groups, inbreeding was commonly used, by using selfing or by mating with related selected individuals (Corley and Tinker 2016).
The total world vegetable oil production is currently around 200 million metric tons (MT), led by oil palm (75 MT), followed by soybean oil (60 MT), rapeseed oil (28 MT), and sunflower oil (19 MT) (Statista 2021). The world demand for oil palm is expected to reach 240 million tons by 2050 (Corley 2009). Oil palm produces an average oil yield of 4 tons per hectare every year, which is approximately 7-10 times higher than soybean (Babu and Mathur 2016; Corley and Tinker 2016; Pirker et al. 2016). Oil palm is an important source of edible oil with over 80% of the products used in the food industry (cooking/frying oil, shortenings, margarine, and confectionery fats), and the rest used in the chemical industry for the formulation of soaps and detergents, pharmaceutical products, cosmetics, biodiesel, etc (Basiron 2007; Corley 2009; Soh et al. 2017).
Despite its wide adaptation and importance, oil palm production and productivity are generally far from their potential due to biotic and abiotic practical constraints. Climate change, land shortage, and labor shortage are among the major factors hindering the current and future yield of oil palm across the world (Corley 2009; Paterson et al. 2013; Barcelos et al. 2015; Pirker et al. 2016; Kwong et al. 2016). The constraints of the conventional methods currently used for oil palm breeding, i.e. long breeding cycle (>15 years) and a limited number of tested individuals, also limit the current palm oil yield (Cros et al. 2015; Jin et al. 2016; Seng et al. 2016). To provide a solution while ensuring a sustainable future, marker-assisted breeding has recently been introduced into oil palm breeding programs (Soh et al. 2017).
Genomic selection (GS) is a marker-assisted selection (MAS) method with a high density of markers on the entire genome so that at least one marker is in linkage disequilibrium with each quantitative trait locus (QTL) (Meuwissen et al. 2001). It is the most effective MAS method to improve quantitative traits (Heffner et al. 2009). Studies on the application of GS in oil palm brought positive results. Thus, GS could improve oil palm clonal selection (Nyouma et al. 2020) and the selection of parents to use for hybrid crossings (Cros et al. 2017). Generally, GS in oil palm can enhance selection intensity and/or shorten the generation interval, thus increasing the annual genetic gain (Nyouma et al. 2019). However, the method could be optimized, in terms of the prediction model, marker density, etc. This should be done in light of the genome properties of the oil palm populations used in the reciprocal recurrent selection breeding scheme, in particular, linkage disequilibrium (LD), effective size (Ne), haplotype sharing, and fixation index (Fst).
Linkage disequilibrium is defined as the nonrandom association of alleles at two or more loci (Weir 1979; Slatkin 2008). The concept of GS relies heavily on LD between QTLs and DNA markers and a good knowledge of LD in the breeding population is necessary to optimize GS (Nakaya and Isobe 2012; Technow et al. 2014; Li and Kim 2015; Bejarano et al. 2018). The LD pattern is shaped by genetic factors, i.e. mutations and historical events that occurred during domestication and population formation, including natural and artificial selection, drift, migration, and non-random mating, as well as from non-genetic factors such as marker ascertainment bias (Flint-Garcia et al. 2003; Gupta et al. 2005; Mackay and Powell 2007; Slatkin 2008).
The number of randomly mating individuals in a population that gives rise to an observed rate of inbreeding is known as effective size (Ne) (Falconer and Mackay 1996). A lower Ne results from higher rates of inbreeding and genetic drift in a population (Lin et al. 2014), making that Ne is one of the major factors influencing LD, and consequently the accuracy of GS (Grattapaglia 2014). There is an inverse relationship between LD and Ne. For a given marker density, training population size, and trait, LD and GS prediction accuracy are higher in populations with low Ne than in populations with high Ne (Solberg et al. 2008; Grattapaglia 2014; Lin et al. 2014). So far, in oil palm, Ne was only estimated in the Deli population (Cros et al. 2014) and there is no information about Ne for La Mé population.
Fixation index (Fst) is used to identify loci with divergent allelic frequencies between two or more populations (Wright 1931). It helps to understand the genetic differentiation among groups (Jakobsson et al. 2013). It ranges from 0 (no variation between populations) to 1 (each population is fixed with a different allele). Fst analysis has been used to identify regions of the genome associated with domestication and selective sweeps associated with breeding (Yan et al. 2017). It can also improve GS and genome-wide association studies (GWAS). For example, (Chang et al. 2019) showed that prioritizing and weighting SNPs based on their Fst values can increase the accuracy of genomic predictions by more than 5%. (Yan et al. 2017) in soybean found that combining GWAS and fixation index analysis helped to identify QTLs for seed weight.
Haplotypes correspond to two or more SNP alleles that tend to be inherited as a unit in the chromosome (Bernardo 2010). Haplotype sharing helps to estimate the genetic resemblance between individuals and is a natural extension of identity by descent (Xu and Guan 2014). Several authors showed that the aggregation of SNPs into haplotypes can increase the prediction accuracy in animals (Calus et al. 2008; Cuyabano et al. 2014; Teissier et al. 2020) and in plant species that were allogamous or with high multiallelism (Matias et al. 2017; Ballesta et al. 2019). Also, consistency of linkage phases between QTL and marker alleles among populations is required to pool them to get a larger population for genetic studies.
The goal of this study is to characterize the genome properties of two major oil palm breeding populations, Deli and La Mé, focusing on key parameters for genomic predictions, namely linkage disequilibrium (LD), haplotype sharing, effective size (Ne), and fixation index (Fst).