Population structure, linkage disequilibrium and association mapping for cotton leaf curl disease incidence in cotton (Gossypium hirsutum L.)

Background: Cotton ( Gossypium hirsutum L.) production is adversely affected by many biotic and abiotic factors. Among the biotic factors, cotton leaf curl disease (CLCuD), caused by a begomovirus, is the most damaging. An association mapping approach was used to nd out population structure and linkage disequilibrium in cotton germplasm; and simple sequence repeat (SSR) markers associated with CLCuD infestation and seed cotton yield per plant (SCY). Results: Data was recorded for percent disease index (PDI) and SCY of 30 diverse cotton cultivars. Out of one thousand SSR markers used for polymorphism screening of plant material, 125 were found polymorphic and used for genotyping of 30 cultivars. STRUCTURE 2.0 software identied three sub-populations in plant material. Linkage disequilibrium (LD) analysis showed that at P < 0.05, 270/7750 marker pairs (3.48%) displayed signicant pair-wise LD and values for correlation between alleles at two loci ( r 2 ) and disequilibrium coecient ( D (cid:0) ) were > 0.05 and > 0.33, respectively, which were stringent enough for identication of valid marker-trait associations. Three markers (BNL1421, BNL946, NAU2336) were associated with PDI and ve markers (NAU3100, NAU3860, NAU2419, NAU3011, NAU4565) were associated with SCY under both GLM (general linear model) and MLM (mixed linear model) analyses. Conclusion: Marker BNL1421 (with phenotypic variance explained ( R 2 ) and heritability values of 26.45% and 0.90, respectively) would be a valuable candidate in marker-assisted selection (MAS) to screen CLCuD susceptible cotton genotypes. Findings of this study will help in understanding genetics of CLCuD infestation and designing future molecular breeding programmes to control this disease. virus,

by hybridization between an African A genome and an American D genome about 1-2 million years ago [8,9]. Diploid species are divided into eight genomic groups A-G and K [10]. Spinable ber is the product of A genome species [11]. World's cotton production is largely based on Gossypium hirsutum and Gossypium barbadense [12]. Gossypium hirsutum, also known as Upland and Mexican cotton, represents 90% of global ber production. It produces longer, stronger and ner ber [13]. Gossypium barbadense, also known as Pima cotton, involves in 8% global cotton production. It is valued for its higher quality ber [14]. Gossypium arboreum (Tree cotton) and Gossypium herbaceum (Levant cotton) contribute less than 2% of global cotton production [15].
World's agriculture production is adversely affected by diseases caused by viruses and unfavorable climatic conditions [16]. Cotton production is threatened by many biotic and abiotic stresses. Among biotic stresses, cotton leaf curl disease (CLCuD) is most damaging. CLCuD is considered as the major threat to cotton production in Pakistan [17]. CLCuD restricts the cotton production by 20%-30% annually in Pakistan which depends on the severity of the disease [18,19]. Millions of cotton bales have affected by this disease and country faced economic di culties. It is estimated that Pakistan economy suffered a loss of 75 million rupees during last ten years due to CLCuD [20]. CLCuD is transmitted by the whitefly Bemisia tabaci [21]. This disease is caused by a begomovirus, belonged to the family Geminiviridae. Geminiviruses infect both monocotyledonous and dicotyledonous plants. Cotton leaf curl virus has a wide range of alternative hosts including okra, tomato, potato, Chinese rose, tobacco, lehli and dhatura [22].
Geminiviruses are single stranded circular DNA viruses. The genome of these viruses consists of 2500-5200 bases. The family Geminiviridae is divided into nine genera on the basis of arrangement of genome and vector. Several insects are involved in the transmission of these viruses such as aphids, leafhoppers, treehoppers and white ies. White ies are involved in the transmission of begomoviruses [23]. The largest genus of family Geminiviridae is begomovirus. More than 320 species are present in this genus. Dicotyledonous plants are infected by begomoviruses [21]. Production of many crops including cotton is reduced on a large scale by begomoviruses in temperate, tropical and sub-tropical areas. All members of family Geminiviridae are monopartite except begomoviuses [23]. Begomoviruses are of two types including monopartite and bipartite begomoviruses. Most of begomoviruses genome contains two types of components. One is known as DNA A and other is known as DNA B. These begomoviruses are known as bipartite viruses. The size of both components of genome is same. Each component of genome is consisting of about 2600 nucleotides. Only DNA A is present in monopartite begomoviruses [21].
Monopartite begomoviruses are responsible for CLCuD [24,25]. The DNA A of bipartite genome contains 6 genes out of which V1 and V2 genes are present on sense strand while other 4 genes are present on antisense strand such as C1, C2, C3 and C4 genes. These genes encode different proteins. For example, replication-initiation protein (Rep) is encoded by C1 gene while C2 gene encodes transcriptional activator protein. Replication enhancer protein is encoded by C3 gene. C4 encodes protein that is necessary for movement of virus. This protein determines the severity of disease and range of host [26]. Two types of satellites are associated with begomoviruses. These satellites are known as alphasatellites and betasatellites [24]. The size of alphasatellites is about 1.4kb. Allphasatellites do not take part in the determination of disease symptoms [27]. The level of viral DNA in infected plants is suppressed by alphasatellites [28]. The replication of helper virus is enhanced by betasatellites. Betasatellites consist of about 1350 nucleotides. These are small and non-circular ssDNA satellites which are linked with majority of monopartite begomoviruses. The protein (βC1) which determines the symptoms of disease is also encoded by betasatellites. The severity of disease is determined by the betasatellites [24]. Disease severity is positively correlated with vrisue/satellite DNA thus more virus/satellite results into more disease severity [17].
CLCuD rst appeared in Nigeria in 1916. But in Pakistan, the cotton plants were rst attacked by this disease in 1967 at Tiba Sultan Pur near Multan district. Its rst epidemic in Pakistan appeared during 1992-93 [29]. CLCuD has speci c symptoms such as growth stunting, visibility of thickened veins, leaf darkening, and leaf enations may be one or more on the underside side of leaf. Enation is a leaf like small outgrowth of plant tissue that is caused by a virus infection [24]. CLCuD caused two types of thickening of veins on cotton plants; i) major vein's thickening ii) minor vein thickening. Leaf margins become thick at the initial stage of disease [30]. Seriously infected plant has shortened internodal length that results into stunted growth of the plant [24,31]. CLCuD adversely affects ber quality. It also reduces seed weight, number of monopodial and sympodial branches [32]. Its susceptible plants have darker color than normal plants due to the proliferation of chlorophyll deposited tissues [33].
Different strategies are used to compete with the problem of CLCuD. But the best option is to use the resistant cultivar. Molecular mapping approach can help to identify the genomic regions of cotton associated with CLCuD resistance. Association mapping is a recent molecular mapping approach which has great resolution. Association mapping (AM), also known as linkage disequilibrium (LD)-based mapping, is used to identify genomic regions associated with phenotypic variations [34]. This technique is used to determine marker-trait associations with the help of genome-wide distributed genes and markers and to study the architecture of the trait. This technique is appropriate for both synthetic and natural populations [35]. AM is a technique that is used to analyze a large number of polymorphisms to assess the QTL effect. It does not require generation of segregating population and a large number of progeny, due to which it is more preferable than linkage analysis (LA). Association mapping can increase mapping resolution. It decreases the research time and greater number of alleles is identi ed [36]. In the present project, an association mapping approach was used to identify simple sequence repeat (SSR) markers associated with CLCuD resistance in cotton.

Phenotypic variation
Great differences were observed among the cultivars for PDI and SCY as revealed from the ANOVA of combined data of two years ( Table 2) Out of total 7750 marker pair combinations, 270 (3.48%), 86 (1.11%), and 29 (0.37%) marker pairs exhibited LD at P < 0.05, P < 0.01, and P < 0.001, respectively. List of marker pairs which exhibited higher LD is given in Table 3. A non-random linkage between loci located on different chromosomes was observed. At P < 0.05, values for correlation between alleles at two loci (r 2) and disequilibrium coe cient (D ) were >0.05 and >0.33, respectively. These values were stringent enough for identi cation of valid marker-trait associations. Therefore, analysis for marker-trait associations assessment was performed.

Discussion
CLCuD susceptibility is a quantitative trati CLCuD is the major cause of huge loss to world cotton production, especially in Pakistan, India, and China [37]. In Pakistan, there is considerable reduction in cotton crop area due to non-availability of CLCuD resistant elite cotton cultivars and cotton farmers were forced to cultivate alternative crops for sustainable earnings from their agricultural lands [3]. Conventional breeding programmes in different countries were successful to develop CLCuD resistant cotton varieties, but all these varieties were with very low SCY potential [38]. These cotton varieties were out of farmers' production scheme within few years after their approval as these varieties could not meet farmers' economic concerns. For sustainable cotton production, it is direly needed that cotton cultivars having CLCuD resistance with higher SCY potential be developed. For combining CLCuD resistance and high SCY potential in a cotton variety is a challenging task. This needs a thorough knowledge and information about genetic potential of cotton genotype for CLCuD resistance as well as its genetic architecture for SCY potential.
Molecular mapping approaches help to tag genomic regions associated with traits of interest [34,36,39,40]. Identi ed markers through these mapping approaches can be incorporated in to markerassisted selection (MAS) programmes to pyramid desirable alleles in a single crop cultivar. In this study, CLCuD susceptibility loci were found located on chromosomes 13 (BNL1421), 14 (NAU2336), and 20 (BNL946). This study showed that CLCuD susceptibility is not a simple Mendelian character controlled by one gene. More than one loci are involved in CLCuD susceptibility. These loci are scattered on different chromosomes. Both A and D genomes contained CLCuD susceptibility loci.

Marker-assisted selection for developing CLCuD tolerant genotypes
In the past, there was paucity of DNA markers associated with CLCuD susceptibility/tolerance. Being a complex genetic character, development of CLCuD tolerant genotypes through conventional breeding approach is not a feasible option as it is time consuming and selection of elite genotypes is cumbersome through this approach. Marker-assisted selection (MAS) and subsequent genomics-assisted breeding is the way forward. Identi ed markers associated with CLCuD susceptibility, especially marker BNL1421, which had higher heritability value of 0.90, would be suitable marker to screen out CLCuD susceptible cotton genotypes. This locus would be further characterized to understand genetics of CLCuD susceptibility.

Conclusions
CLCuD is affecting cotton production in many countries of the world. This study showed that cotton genotypes had considerable differences and diversity with regard to CLCuD tolerance. MS-39, cotton cultivar showing highest tolerance to CLCuD, needs to be genetically characterized to understand CLCuD tolerance mechanism in minute details. Similarly, genetic characterization of MNH-636, cotton cultivar with highest susceptibility to CLCuD, would also be helpful in designing genomics-assisted breeding (GAB) strategy to develop elite cotton cultivars with CLCuD tolerance. Fine mapping of identi ed markers associated with PDI would lead to identi cation of putative genes/alleles involved in CLCuD susceptibility/tolerance. Findings of this study showed that more than one loci were involved in CLCuD susceptibility and GAB is the most suitable strategy to develop elite CLCuD tolerant cotton genotypes with sustained SCY potential and conventional breeding efforts are inadequate to meet this challenge.

Plant material
Plant material consisted of thirty cultivars of cotton (Gossypium hirsutum L.). List of these cultivars, including their original sources, is given in Table 1. All these were commercial cultivars originated from different cotton research centers of Pakistan and USA. These cotton cultivars were part of cotton germplasm maintained at Cotton Research Station, Ayub Agricultural Research Institute, Faisalabad, Pakistan (CRS-AARI, FSD) (https://aari.punjab.gov.pk/crs_faisalabad). CRS-AARI, FSD routinely acquires cotton germplasm from national and international cotton research institutes to ful ll its objective of "expansion of gene pool" and maintains this cotton germplasm in a very systematic and planned manner with complete record of source and origin. It is fully authorized and had license for this cotton germplasm collection and maintenance. Field experimentation of this study was conducted at CRS-AARI, FSD. Percent disease index was calculated by following formula: At maturity, the data for seed cotton yield per plant (SCY) was recorded. SCY was measured in grams (g) with laboratory measuring balance.

SSR genotyping
Plant material was screened for polymorphism using 1000 simple sequence repeat (SSR) markers. Ninetytwo SSR markers were found polymorphic. Genomic DNA was extracted by using 4-5 young, fully expanded leaves. Leaves were stored at -80 ºC. DNA was extracted by using DNA extraction Aidlab DNA mini kit (Cat #: DNA15, Lot: 261525AX). The DNA integrity was checked by electrophoresis on 1% agarose gel and quanti ed by Nanodrop 2000. The DNA ampli cation was performed by using standard PCR procedures described by Zhang et al. (2000) [42]. The DNA ampli cation of thirty commercial cultivars was performed by SuperCycler thermal cycler (Model: SC300G, Kyratec limited, Australia). This ampli cation was carried out with the help of Bioron Taq Master Mix, Bioron DFS-Taq DNA Polymerase, Bioron set of 4 dNTPs and Bioron 100 bp plus DNA ladder ready-to-use. Ninety-two polymorphic primers were used for ampli cation. DNA bands of ampli cation product were developed with silver staining and recorded with SX-image system. SSR markers were scored on their base pair sizes. Missing data was indicated by "-9" and "?" according to the software requirements.

STRUCTRE analysis
The pattern of population/genetic structure in thirty cultivars was assessed by using software package STRUCTURE [43]. It is a model-based clustering method.

Analysis of marker-trait associations
Marker-trait associations were assessed by GLM (general linear model) association test incorporating genotypic and phenotypic data; and Q (structure relatedness) matrices into TASSEL 2.1 software [44]. TASSEL stands for Trait Analysis for aSSociation, Evolution and Linkage.  Chr chromosome; D disequilibrium coefficient; LD linkage disequilibrium; pDiseq probability value for LD; r 2 correlation between alleles at two loci