Study periods, geographic area, and collection strategy:
Plant materiel was collected through two collection campaigns in Conakry Guinea. The first campaign (hereafter referred to as Collect-1) took place in 1979 and 1982 (Bezançon et al., 1983). The second collection campaign (Collect-2) was undertaken in 2011 by scientists from Cirad (https://www.cirad.fr/) and IRAG (https://irag-guinee.org/). The collect area stretched from tropical forest to Sudanian savannah agro-ecological regions of Conakry Guinea: latitude 7°34'19.96"N-12°29'1.50"N; longitude 8°25'0.0"O-13°17'59.17"O (S-Figure 1).
The collect strategy was to drive along major roads and other practicable trails, stop in villages without prior notice, conduct a quick inventory of the rice varieties and collect a sample of each of the identified varieties, in the field or from the granary. The passport data for Collect-1 samples included the village name, the variety name, the species (O. sativa or O. glaberrima) and, occasionally, the rice growing ecosystem (upland, hydromorphic, lowland) and the type of collect (field/granary, bulk/panicle). Approximately 40% of the accessions were collected in the field.
To prepare the Collect-2 campaign, the geographic coordinates of a large share of the villages visited during Collect-1 was retrieved using Google-Earth software. The collect team borrowed the track of Collect-1 campaign but did not limit its collect to the Collect-1 villages, especially as some villages could not be found. The passport data for samples from Collect-2 included the village name and its geographic coordinates, the variety name and tentative species membership, the rice growing ecosystem and the type of collect. Approximately 80% of the samples were panicles collected in the field.
For Collect-1, 442 accessions were retrieved from the long-term conservation facilities of the Institute de Recherche pour le Development (https://www.ird.fr/). The Collect-2 campaign yielded 776 accessions.
Genotyping
All Collect-1 and Collect-2 accessions were first genotyped with 16 SSR markers using DNA from a plant obtained with a seed visually most representative of the seed stock of each accession. These data served to (i) ascertain the membership of accessions to one of the three genetic groups (O. sativa indica, O. sativa japonica, and O. glaberrima), (ii) detect redundant accessions, (iii) assess the pattern of genetic diversity within each genetic group, for each collect. Based on such information and logistic issues, 600 accessions were selected for genotyping by sequencing (GBS). DNA libraries were prepared at the Regional Genotyping Technology Platform (http://www.gptr-lr-genotypage.com), using the MATAB method for DNA extraction and the ApekI enzyme for DNA digestion. The libraries were single-end sequenced using Illumina HiSeq™2000 (Illumina, Inc.) at the Regional Genotyping Platform (http://get.genotoul.fr/). The Fastq sequences were aligned to the rice reference genome, Os-Nipponbare-Reference-IRGSP-1.0 with Bowtie2 (default parameters). Non-aligning sequences and sequences with multiple positions were discarded. Single nucleotide polymorphisms (SNP) were called using Tassel GBS pipeline v5.2.29 (Bradbury et al., 2007). First, the genotypic data for all accessions taken together were filtered for quality score (> 20) and bi-allelic status of SNPs. Second, the genotypic data of accessions from each genetic group were filtered separately for minor allele frequency (MAF > 1%), rate of missing data (< 20%), and heterozygosity (< 5%). The missing data were imputed, separately for each group, using Beagle v4.0. Table 1 summarizes the results of the process.
Table 1
Rice accessions from the two collect campaigns that were genotyped and/or phenotyped for the present study.
Genetic group | Number of SNP | Number of accessions genotyped | Number of accessions phenotype |
Collec-1 | Collect-2 | Total | Collect-1 | Collect-2 | Total |
O. sativa Indica (Osi) | 14,775 | 15 | 55 | 70 | 73 | 180 | 253 |
O. sativa japonica tropical (Osj) | 9,282 | 96 | 154 | 250 | 107 | 147 | 254 |
O. glaberrima (Og) | 6,250 | 111 | 99 | 210 | 59 | 85 | 144 |
Total | | 222 | 308 | 530 | 239 | 412 | 651 |
Phenotyping
Independent from the above-mentioned selection of accessions for genotyping, 239 accessions from Collect-1 and 412 accessions from Collect-2 were selected (Table 1) to be phenotyped for the duration of the sowing to heading date (DTHD). Accessions from Collect-1 were first seed increased in a greenhouse in Montpellier, France. Phenotyping of accessions collected from the upland and the lowland ecosystems was performed separately. The 362 accessions collected from the upland were cultivated in an upland field, in the IRAG research station in Guinea (Kindia, 10°00’51.37 N, 12°50’66” W), in 2012. Accessions grown in the lowland were sown twice (mid-July and mid-August) in 2013. The elementary plot for each accession was three consecutive lines of 1m long, and the DTHD was recorded on five plants of the middle line. Homogeneity of growing conditions was assessed by the systematic presence of two check varieties every 20 test accessions. There were 364 accessions in common between GBS genotyping and field phenotyping.
Assessment of genetic diversity and population parameters
Pattern of genetic diversity at the whole accession level and, separately, at the level of each genetic group was investigated using distances between individual accessions estimated by a simple-matching dissimilarity index, computed from genotypes at 1,130 SNP common to the O. sativa indica (Osi), O. sativa tropical japonica (Osj) and O. glaberrima (Og) accessions. An unweighted neighbour-joining tree was constructed based on this dissimilarity matrix, using DARwin.6 (Perrier & Jacquemond-Collet, 2006).
Population parameters were assessed for each genetic group subdivided into two subpopulations representing Collect-1 and Collect-2. This is Osi-1 and Osi-2 for the O. sativa indica, Osj-1 and Osj-2 for the O. sativa tropical japonica, and Og-1 and Og-2 for O. glaberrima. The relative importance of genetic diversity among subpopulations, among individuals within a subpopulation and within individuals, was assessed by analysis of molecular variance AMOVA with Arlequin software version 3.5.2. (Excoffier et al., 2010). The same software served to compute nucleotide diversity, expected heterozygosity per locus, the Wright (1965) F-statistics (FIS and FST) and the population parameter \(\theta =4{N}_{e}\mu\), where \({N}_{e}\) is the effective population size and µ is the overall mutation rate at the haplotype level.
The speed of decay of linkage disequilibrium (LD) in each subpopulation was estimated by computing r² between pairs of markers on a chromosome basis, using the “full matrix” option of Tassel 5.2.6 (Bradbury et al., 2007), and then by averaging the results by classes of distance.
Detection of loci under selection
Presence of loci under selection was investigated through genome scan for loci for which the extent of differentiation between subpopulations Collect-1 and Collect-2 of a given genetic group, summarized by the FST statistic, is significantly higher than the expected FST values under neutrality hypothesis, for a given average heterozygosity (h0) between populations and for a given demographic model (Beaumont & Nichols (1996). Two genome scan methods were implemented: a standard approach assuming outcrossing reproduction (Excoffier et al., 2009) and a method specifically developed for mainly autogamous species (Navascués et al., 2021).
In the Excoffier et al. (2009) approach (hereafter referred to as the “heterozygosity-based” method), first the locus-by-locus differentiation FSTi is computed as \({\widehat{F}}_{STi}= ({\widehat{f}}_{0i} – {\widehat{f}}_{1i})/(1-{\widehat{f}}_{1i})\), where \({\widehat{f}}_{0}\) is the average homozygosity within a population and \({\widehat{f}}_{1}\) is the probability that two genes from different subpopulations are identical (Weir & Cockerham (1984). Second, Global FST is computed as a weighted average among loci, where \({\widehat{F}}_{STi}\) values are weighed by the heterozygosity between populations\({h}_{1i }=1-{\widehat{f}}_{1i}\) and then the heterozygosity between populations \({\widehat{H}}_{1}\) is inferred from the average heterozygosity \({\widehat{h}}_{0}\) as \({\widehat{H}}_{1}= {\widehat{h}}_{0} /(1- {\widehat{F}}_{ST})\). Third, the joint distribution of the neutrality FST and heterozygosity is obtained by coalescent simulation under the hypothesis of hierarchical island model of population differentiation. Finally, locus-specific FST P-values are obtained from the simulated joint distribution of \({\widehat{H}}_{1}\) and \({\widehat{F}}_{ST}\) by a kernel density approach. Under the hierarchical island model, the population structure is arranged into k groups of d demes, in which migration rates between demes are different within and between groups. The mutation model implemented is the SNP mutation model, in which a mutation occurs at random along the simulated structured coalescent tree. The approach was implemented using Arlequin 3.5.2 (Excoffier et al., 2010). The value for k and d was set to 50 and 10, respectively, to mimic a large number of villages each owning less than 10 rice varieties and preferential interactions between varieties of the given village. The number of simulations was set to 50,000.
The Navascués et al. (2021) method (hereafter referred to as “drift-based” method), relies on principles first proposed by Goldringer & Bataillon (2004), that is using the estimated magnitude of drift between two time samples as a null model to test for homogeneity of differentiation across loci. In the case of complete neutrality, all sampled loci should provide estimates of genetic differentiation drawn from the same distribution. Assuming a single isolated population, this distribution depends on the strength of the genetic drift, that is, on the length of the period (t in number of generations) and on the effective population size (Ne). Loci under directional selection or linked to such loci are expected to exhibit larger differentiation values than expected under the neutral hypothesis. In order to account for the mainly selfing reproduction system, Navascués et al. (2021) developed an implementation procedure that relies not on the observed allele frequency, but on genotype (AA, Aa and aa) frequency. Indeed, in a mainly selfing species, allele copies within an individual are not independent samples from the population, while genotypes are independent samples. Moreover, in order to account for uncertainty of the initial genotype frequency, assuming the same prior probability for the three genotype frequencies, genotype frequencies in the population are sampled from the posterior probability distribution with Dir(K0 + 1), where K0 is the observed genotype counts in the sample of the focal locus at time \(t=0\).
Locus-by-locus differentiation FSTi are computed with the Weir & Cockerham (1984) method. Temporal differentiation is measured by estimating the FST with the analysis of variance approach proposed by Weir & Cockerham (1984). Then, the temporal FST is used to estimate Ne as: \({\widehat{N}}_{e}= \tau (1-{\widehat{F}}_{ST})/4{\widehat{F}}_{ST}\), where τ is the number of generations between the two temporal populations. The null distribution of single locus FSTi is built through simulations of drift. Each of these simulations consists in (i) drawing the initial genotype frequency K0 of the locus, conditional on data, (ii) simulating allele frequency change for τ generations, based on \({\widehat{N}}_{e}\), (iii) simulating samples by sampling genotypes with genotype frequencies based on \({\widehat{F}}_{IS}\), and (iv) calculating \({F}_{ST}^{*}\) for the simulated sample. The proportion of \({F}_{ST}^{*}\) equal or larger than the observed \({F}_{ST}^{l}\) provides an estimate of the p-value for the test. In order to account for multiple tests across the genome, false discovery rates (q-values) are computed from the distribution of the p-value. Genotype counts observed in the simulated sample of n0 individuals are modelled as coming from a multinomial distribution Mult(n0,\({\gamma }_{0}\)), where \({\gamma }_{0}\) are the genotype frequencies in the population. In the subsequent generations, allele frequencies πt are simulated following a binomial distribution as \({\pi }_{t} \tilde B(2{\widehat{N}}_{e}, {\pi }_{t-1})\) where \({\widehat{N}}_{e}\)is the genome wide estimate of the effective population size and \({\pi }_{0}\) is determined by\({\gamma }_{0}\). At time\(t= \tau\), simulated genotype counts in samples are taken from the multidimensional distribution \({K}_{\tau }^{*} \tilde Mult({n}_{\tau }, {\gamma }_{\tau })\), were \({n}_{\tau }\)is the size of the sampled population at that time, and \({\gamma }_{\tau }\) are the genotype (AA, Aa and aa) frequencies in the populations as a function of the allele frequency \({\pi }_{t}\) and the multilocus inbreeding coefficient \({\widehat{F}}_{IS}\) the estimate from both temporal samples (Weir & Cockerham, 1984). Selfing rate is assumed constant across generations. The method was implemented using DriftTest script (Navascués & Vitalis, 2020).
The Osi group was excluded from search for selection footprint given the small number of accessions available (70) and given the important disequilibrium in the number of accessions in Collect-1 and (15) and Collect-2 (55).
Functional characterisation of loci under selection
Genes located within an interval of 250kb on both sides of independent loci detected to be under selection, were determined using the RAP-DB annotation of rice genome (https://rapdb.dna.affrc.go.jp/index.html). The corresponding MSU7.0 gene identity (TIGR) was retrieved using gene-ID converter tool proposed by RAP-DB. The window size of 250kb was chosen to represent approximately half the distance at which LD decayed to half its initial level, for both Og and Osj groups, estimated at 450kb on average across all chromosomes.
Enrichment analysis of genes within the chromosomic segments bearing loci under selection was performed using two gene ontology tools: Agri-GO (Tian et al., 2017; http://bioinfo.cau.edu.cn/agriGO/) and Panther-GO (Mi et al., 2019; http://GeneOntology.org), that use RAP and MSU gene identity, respectively, as input.
The chromosomic segments bearing loci under selection were also investigated for genes known to be involved in reproductive development processes, using available literature (Wei et al., 2020; http://www.modelcrop.org/). Likewise, chromosomic segment bearing loci under selection were investigated for enrichment in quantitative trait loci (QTL) involved in DTHD, using the Gramene QTL database (http://www.gramene.org/).
Analysis of the temporal evolution of agro-climatic parameters
Climatological data (rainfall and temperature) of the 1961–2010 period were extracted from the archives of Guinea National Weather Service for a weather station in tropical forest area (N’Zérekoré, 7°48’53.2”N, 8°42’14.11”W) and a second station in the Sudanian savannah (Kankan, 10°23’01.65”N, 9°18’18.72”W) of the study region. Evolution of the annual rainfall and of the annual crops’ growth season rainfall was characterised using the Lamb index (Lamb, 1982): \(I=({X}_{i}-X)/\sigma\), where \(I\) is the standardised anomaly (Lamb index), \({X}_{i}\) is the climatic variable (rainfall) for the year i (or a portion of the year i), \(X\) is the average of the same variable in a reference period considered as normal, and \(\sigma\) is the standard deviation of the variable in the reference period. The normal meteorological reference period considered was 1961–1990, as recommended by the World Meteorological Organisation (WMO, 1996). The period considered as “crops’ growth season” was from June 1st to the end of October. The same methods were used to characterise the evolution trends of the minimum and maximum temperatures.