Endemism in Recently Diverged Angiosperms Is Associated With Chromosome Set Duplication

Chromosome set duplication (polyploidy) drives instantaneous speciation and shifts in ecology for angiosperms, and is frequently observed in neo-endemic species. However, the extent to which chromosome set duplication is associated with endemism throughout the owering plants has not been determined. We hypothesised that across the angiosperms polyploidy is more frequent and more pronounced (higher evident ploidy levels) for recent endemics. Data on chromosome counts, molecular dating and distribution for 4210 species belonging to the major clades of angiosperms were mined from literature-based databases. As all clades include diploid taxa, with polyploids representing a possible ‘upper limit’ to the number of chromosomes over evolutionary time, upper boundary regression was used to investigate the relationship between the number of chromosomes and time since taxon divergence, both across clades and separately for families, with endemic and non-endemic species compared. A signicant negative exponential relationship between the number of chromosomes and taxon age was evident across angiosperm clades (R 2 adj =0.48 with endemics and non-endemics considered together, R 2adj =0.46 for endemics; R 2adj =0.44 for non-endemics; p ≤ 0.0001 in all cases), which was three times stronger for endemics (decay constant=0.12, cf. 0.04 for non-endemics). The majority of families exhibited this relationship, with a steeper regression slope for endemic Campanulaceae, Compositae, Leguminosae, Poaceae, Caryophyllaceae and Rosaceae, cf. non-endemics. In conclusion, chromosome duplication is more frequent and extensive in recent angiosperms, particularly young endemics, supporting the hypothesis of polyploidy as a key driver of neo-endemism. However, as young endemics may also be diploid, polyploidy is not an exclusive driver of neo-endemism. The European Paleoendemic Haberlea rhodopensis (Gesneriaceae) has an Oligocene origin and a Pleistocene


Introduction
Why endemic species occur spontaneously in restricted geographic ranges is a key question in biogeography and ecology with implications for the conservation of rare and nationally important species (Olivieri et al. 2015). For many endemics, reasons for range restriction are evident. For example, they may be limited to a small area because they can occupy a limited ecological niche (Williams et al. 2009), or have originated by hybridization and occur only in the area of contact between the progenitor species (Grünig et al. 2021), they may be relict species remaining from a much wider past range (paleo-endemics; Petrova et al. 2015) or are recent species (neo-endemics, sensu Stebbins and Major 1965) yet to disperse beyond the centre of origin (Behroozian et al. 2020). An additional explanation may lie with genome duplication (polyploidy), which effectively represents an 'instantaneous', sympatric speciation mechanism (Mayr 1963), and can be a major driver of plant evolution (Levin 1983;Otto and Whitton 2000;Soltis et al. 2014;Van de Peer 2017). Polyploidy occurs when the cell division cycle includes chromosome duplication but lacks the subsequent stages of chromosome and cell component separation, resulting in cells with multiple copies of chromosomes. This can occur during mitosis (in a process known as somatic doubling) or meiosis (non-reduction during sporogenesis), either within populations of a single species (autoploidy), or subsequent to interspeci c hybridization (allopolyploidy) (Ramsey and Schemske 1998). Multiple copies of chromosomes require larger nuclei and cells, often altering the physiology and morphological traits of the offspring with respect to parental species: indeed, these events are usually linked with key innovations in phenotype that effect tness (for example, polyploids exhibit a greater tendency to use vegetative reproduction; Herben et al. 2017; see also Soltis and Soltis 2016). Altered phenotypes in polyploids are known to change the ecology of these species with respect to diploid progenitors, for example larger cell size and larger organs (the 'giga effect') produce larger owers of different colours and scents that favour different pollinator species (see examples in Rezende et al. 2020). Exhibiting different sets of chromosomes, polyploids are often reproductively isolated from mother plants (Husband and Sabara 2004;Laport et al. 2016;Lavania 2020). Ancient events of genome doubling are thus associated with increased rates of speciation (Husband et al. 2013;Soltis and Soltis 2016). Polyploidy occurs frequently in plants, and particularly in angiosperms (Soltis and Soltis 2016;Lavania 2013Lavania , 2020: it has been determined that all angiosperms show evidence of multiple events of genome doubling, except for basal Amborella trichopoda Baill. (Soltis et al. 2009;Jiao et al. 2011;Lavania 2020). Polyploid neo-endemics have been shown to exhibit a particularly wide range of functional traits and capacity to adapt, with geographic range constrained simply by limited time for dispersal (Behroozian et al. 2020).
However, despite being widely recognised as an important process for plant evolution and ecology, the extent to which chromosome set duplication represents a general mechanism in the emergence of endemic species throughout the angiosperms has yet to be quanti ed.
Understanding the extent to which chromosome set duplication is associated with endemism is complicated by the fact that endemism is de ned, on a case-by-case basis, in terms of geographical regions or political boundaries, which are extremely variable and often biologically irrelevant. Furthermore, the distinction between 'neo-endemic' and 'paleo-endemic' may be based on perceived past distributions inferred from current distributions, rather than objective measures of events over time. Additionally, it can also be di cult to de ne the range of an endemic species, which can be relatively extensive (i.e. "continental endemic") or comparatively restricted (i.e. "local endemic") (Coelho et al. 2020). Moreover, geographical range determined from occurrence records and species distribution models does not always consider local compliance with the ecological requirements of the species, the degree of habitat preservation or the presence of highly anthropized areas, and thus should be interpreted carefully (Broakes et al. 2010;Beck et al. 2014;Fithian et al. 2014). The highest richness of endemics is found in biodiversity hotspots (Cañadas et al. 2014;Noroozi et al. 2018), which have been identi ed in 36 areas around the globe (Conservation International, www.conservation.org; Critical Ecosystem Partnership Fund, www.cepf.net), and range from 18,972 Km 2 (New Caledonia) to 2,373,057 Km 2 (Indo-Burma region). Such heterogeneity often requires the identi cation of smaller, higher-concentration areas within these regions ("hotspots within hotspots") to effectively target conservation efforts (Cañadas et al. 2014;Noroozi et al. 2018).
A further complication may arise from the fact that ploidy level is not always easy to de ne, even when the basic number of chromosomes is known for a taxon. Indeed, this can be the result of ancient autopolyploidy or allopolyploidy (Parisod et al. 2010;Lavania 2013; Zozomová-Lihová et al. 2014), sometimes followed by diploidization events and thus chromosome loss (i.e. diploidized paleopolyploids; Tamayo-Ordóñez et al. 2016;Qiao et al. 2019). Moreover, some taxa show intraspeci c variability, with multiple chromosome counts and different ploidy levels, which is expected when polyploidy events occur relatively frequently (Husband et al. 2013). While polyploidy multiplies sets of chromosomes (and thus has striking effects during karyotype evolution), a range of processes can subtly rearrange single chromosomes, altering chromosome number or characteristics such as size and DNA content. These include insertion, deletion or duplication, inversion, and intra-or inter-chromosomal reciprocal translocation, which can be particularly evident for paleo-species that have accumulated changes over time (reviewed by Schubert and Lysak 2011). Essentially, change in the overall set of chromosomes occurs due to polyploidy, further mediated by complicating processes that can both increase and decrease the number of chromosomes; the relationship between the number of chromosomes and polyploidy is not necessarily straightforward. With this caveat in mind, in the present study it is assumed that entire genome multiplication in the sporophyte generation (2n) is principally affected by polyploidy, and the number of chromosomes is used as a quantitative measure to represent the net result of karyotype evolution (see Methods for details).
Despite the considerable attention given to the classi cation of endemic species in terms of when and how they originated (Favarger and Contandriopoulos 1961;Stebbins and Major 1965;Maers and Giller 2013), an age threshold differentiating paleo-from neo-endemics remains unde ned. Indeed, the term paleo-endemic refers more to a process than to a particular time or period per se (i.e. endemism by restriction or fragmentation of a previously extensive range). However, when comparing speciation events and the evolution of taxa it may also be useful to refer to the absolute framework of geological time and the climatic and geographic context that this provides. While species younger than 1 million years are clearly neo-endemics (Kraft et al. 2010), the issue becomes complex for less recent species. Even without providing a precise criterion, numerous attempts have been made to identify a clear discriminating principle. Ferreira and Boldrini (2011), for example, addressed the problem by suggesting the combination of a dated phylogeny (i.e. estimated age and degree of systematic isolation) with environmental context (based on stratigraphy and the age of underlying rocks). Lazarina et al. (2019) considered reproductive and geographic isolation, while Mishler et al. (2014) proposed a Categorical Analysis of Neo-And Paleo-Endemism (CANAPE) based on the Relative Phylogenetic Endemism index, to distinguish centres of neo-and paleo-endemism. However, these methods are too unwieldy to be used for a prompt distinction between neo-and paleo-endemics in large datasets, and an age threshold xed a priori, despite being relatively imprecise, provides a consistent and readily applicable approach. Among the recently formed endemics, particular relevance has been given to apo-endemics, or polyploids diverged from diploid progenitors (Stebbins and Major 1965). However, the origin of a polyploid and divergence in the case of sympatric speciation is not always possible to date (Doyle and Egan 2010). Indeed, dating polyploidy events and their role in creating new taxa has so far been limited to the timing of major clade emergences (Wood et al. 2009), lacking su cient detail to compare particular species within families or genera. In the present study, rather than entering into the debate regarding what constitutes a neo-as opposed to a paleo-endemic, we refer simply to the time period elapsed since the divergence of the taxon.
In summary, the comparison of estimated taxon age (Ma since divergence) with the number of chromosomes will allow investigation of whether the chromosome complement is higher in recent species than in ancient ones; information on the occurrence range of species will allow assessment of whether the phenomenon is general within the angiosperms or relatively prevalent in endemics. Based on these data, the principal objective of the present study is to assess whether polyploidization events are principal drivers of the emergence of new endemic species. Speci cally, it is hypothesized that, despite a prevalence of diploid taxa throughout evolutionary time: 1). the highest chromosome counts are evident for angiosperm taxa that have diverged recently, 2). higher numbers of chromosomes are particularly evident for recent endemic (cf. non-endemic) taxa, 3). the character of the ploidy level/divergence time relationship is consistent throughout the angiosperms, from ancient to recently diverged clades.

Data mining
The relationship between genome duplication and neo-endemism for angiosperms involved analysis of sporophyte chromosome count data, with neo-endemism (re ecting both restricted geographic distribution and taxon age) being represented by 'time since divergence' and geographic presence/absence data. These data were mined from databases containing values from the scienti c literature, and directly from the literature itself, and aimed to broadly represent both endemic and nonendemic taxa across the angiosperms. Thus data mining was performed for major clades of angiosperms following the recent phylogeny of Leebens-Mack et al. (2019), starting with ANA-grade taxa (represented by Nymphaeaceae -other families in this clade are too under-represented in terms of both chromosome number and taxon age data), and following the phylogeny via the monocots (represented by Poaceae and Orchidaceae), Magnoliids (Magnoliaceae), Ranunculids (Ranunculaceae), Caryophyllales (Caryophyllaceae), Asterids (Apiaceae, Campanulaceae, Compositae, Ericaceae, Primulaceae), Saxifragales (Saxifragaceae) and Core Rosids (Brassicaceae, Euphorbiaceae, Leguminosae, Rosaceae, Violaceae). Taxonomic name standardization was indirectly ensured using data from The Chromosome Counts Database (CCDB, v 1.47: ccdb.tau.ac.il/browse), which is based on the automatic taxonomic name resolution software Taxonome (Kluyver and Osborne 2013) and the consensus database The Plant List (v 1.1: www.plantlist.org). Species with both "accepted" and "unresolved" taxonomic status were included in the analysis: subspecies and varieties were discarded.
The diploid number of chromosomes for the sporophyte generation of each species was attained from the CCDB (last access: October 2020), as a quantitative proxy of ploidy level. Only one count per species was included in the analysis, except when different counts were equally reported in the database. When multiple counts were reported for a species, the modal value was retained; when a species exhibited multiple modal values, all of these were retained (e.g. 2n=25, 2n=30, 2n=36 for Paphiopedilum victoria-mariae; Orchidaceae). Missing sporophyte values were obtained by doubling the gametophyte counts. B chromosomes were not considered. Negative values, as well as 0 and 1 were considered errors (being biologically improbable) and discarded. When possible, the source material used for the counting was checked: usually the mitotic counts were made using the root-tip squash method (Miller 1961), while meiotic counts were made from oral buds (reviewed by Windham et al. 2020). Chromosome counts for each taxon are presented in Table S1 (Online Resource 1).
The age for each taxon was obtained from the public database Timetree: The Timescale of Life (TTOF, www.timetree.org; last access: December 2020). Molecular dating has been applied to an increasing number of species (the largest dated phylogenetic tree for the angiosperms comprises more than 36,000 species, belonging to ~8,400 genera, 426 families and all orders; Janssens et al. 2020), but heterogeneity in datasets, sequences, calibrations and the software used can yield different estimates for the same species, often hindering comparison between the results of different studies (Pulquério and Nichols 2007). TTOF provides a comprehensive synthesis of data published between 1987 and 2013 (2,274 studies) for 50,632 species, of which 14,465 angiosperms, and offers data uniformity, rapid data access and a robust foundation in the scienti c literature (Hedges et al. 2006(Hedges et al. , 2015Kumar and Hedges 2011;Kumar et al 2017). Divergence time between taxa is estimated through a hierarchical average linkage method; the maximum consistency with single original time-tree is ensured by testing and updating topological partitions (Hedges et al. 2015). A preliminary check of data included in the TTOF estimates was made from original chronograms in speci c published papers, cited by TTOF. The discretional value of 0.001 Ma was attributed to extremely recent nodes, when a speci c "estimated time" was not indicated (e.g. Adenocarpus hispanicus (Leguminosae), Anemone hepatica, A. narcissi ora (Ranunculaceae), Magnolia coco, M. obovata and M. o cinalis (Magnoliaceae) (see Table S1 for estimated divergence times for all study species; Online Resource 1).
Here, we classify species as endemic on a case-by-case basis using a combination of quantitative data (geographic range limited to below a set threshold) guided by qualitative information such as designation as 'endemic' in national and regional oras. The geographical range for each species was obtained from public databases ( Soroseris, Stebbinsia and Syncalathium). Species with a geographical range not exceeding 600,000 km 2 were classi ed as endemic. This threshold was chosen in order to include the remaining vegetation of the 36 Biodiversity Hotspots (see Table S2, Online Resource 2) according to Conservation International. Since the extension of some hotspots (i.e. Indo-Burma region, Brazil's Cerrado, or Mediterranean Basin) exceeds 2,000,000 km 2 , and also includes urban areas, only the extension of the remaining vegetation was considered. Classi cation as endemic or not was also checked by determining whether species are recognised as endemic in national or local oras (for example: New Zealand Plant Conservation Network, http://www.nzpcn.org.nz; Cellinese et al. 2009 for endemic Campanulaceae of Crete; Brochmann et al. 1997 for endemics of Cape Verde).

Data analysis
The coverage rate of the available data for each family was calculated as a percentage ratio between the number of species included in the analysis and the total number of species (both accepted and unresolved) reported in the consensus database The Plant List. Three separate analyses were performed both on the totality of data collected (referred to as "angiosperms") and on subsets for single families, further sub-divided into analysis of all species (endemics and non-endemics), and endemics and nonendemics treated separately. For Nymphaeaceae only one analysis was performed, due to lack of data on endemic species, while for Magnoliaceae, Rosaceae, Saxifragaceae and Violaceae analysis of endemic species was not performed, due to insu cient data (10 spp. or less). Analyses were performed using the statistical software R (version 3.5.1; R Core Team 2018). Data were plotted according to the estimated time since divergence (x axis) and the number of chromosomes (y axis), using the ggplot2 package (Wickham et al 2020).
In order to investigate the maximum number of chromosomes exhibited by taxa over geological time (time since taxon emergence), an upper boundary regression was applied to the data, tting the regression curve only to the highest values of the data set. Boundary functions are widely used in ecological studies to highlight the maximum effect of a process, otherwise concealed by the weight of the mean values (Pierce 2014 and references therein). To remove the effects of redundant chromosome counts within each family, age values along the x axis were divided into periods ('bins') of 1 million years, and regression was tted to the ve highest y values within each bin. The function applied was an exponential decay, with the formula: y=Ae (-kx) -c; where: y=sporophyte number of chromosomes; A = initial quantity (estimated y value for x = 0); k = decay constant; x = estimated time since divergence; c = lowest y value for each family. Exponential decay was chosen for modelling the relationship between number of chromosomes and age, since polyploids originate from progressive duplications of the genome (Brysting et al. 2007). The c parameter was introduced in order to obtain a horizontal asymptote equal to the lower chromosome count and avoid curves tending to zero, as zero chromosomes would be biologically unrealistic.
Finally, the percentage ratio between the number of polyploids (sensu Wood et al. 2009) and the total number of species for each family included in the analysis was calculated in order to test whether taxonomic groups are differentially predisposed to polyploidy, in terms of both formation and establishment. All data are available in a Microsoft Excel spreadsheet format (Table S1; Online Resource 1).

Results
Analyses were performed on a total number of 4530 records, corresponding to 4210 species, 1270 (30.2%) of which were found to be endemic. For each family, the coverage rate of collected data generally did not exceed 4% of taxa recorded for the family (see Table S3, Online Resource 3), with the exception of Ranunculaceae (5.8%), Apiaceae (6.2%), Magnoliaceae (9.1%) and Campanulaceae (12.6%). According to the estimated crown age (Table S3,  Ma) and Apiaceae (54 Ma, 95% C.I.: 29-57 Ma) are the most recent. However, uncertainty related to the estimated crown age was sometimes substantial: in Rosaceae, the extreme case, the con dence interval differed by 120 million years. Leguminosae (71-80 Ma), Violaceae (67-77 Ma) and Ranunculaceae (75-88 Ma) exhibited the least variable estimates. The Saxifragaceae family showed the greatest difference between the oldest species and the estimated crown age for the entire family (65.3 million years), while the lowest difference (20 million years) occurred in Leguminosae. Indeed, the oldest species included in the analysis (56 Ma) belong to the genus Sophora: S. a nis, S. avescens, S. microphylla and S. secundi ora. Most data fell in the last 10 million years, with the exception of Magnoliaceae and Nymphaeaceae, which showed spikes in species divergence between 28.3 and 30 Ma, and at 11.3 Ma, respectively. In some families (Euphorbiaceae, Leguminosae, Orchidaceae, Poaceae, Primulaceae), available data for endemic species are relatively recent, compared to non-endemic species. However, only eight endemic species (Helleborus orientalis, Lobelia physaloides, Magnolia guatemalensis, M. kobus, M. minor, M. salicifolia, Malus tschonoski, Pimpinella siifolia), or approx. 0.6% of the total number, are older than 20 Ma. Recurrent numbers of chromosomes were evident in all families, which sometimes represent high percentages of data, for example 2n=22 in Apiaceae (58%), 2n = 34 in Campanulaceae (38.8%), 2n = 38 in Magnoliaceae (67.7%), 2n = 32 in Ranunculaceae (54.9%) and 2n = 48 in Violaceae (39%). Model parameters (i.e. k = decay constant; A = number of chromosomes estimated for age = 0 Ma) and statistics (i.e. R 2 adj and p-value) were extracted from model elaboration in the R environment. Hereafter, the adjusted R 2 values for each general analysis (total species (T) representing endemics (E) plus nonendemics (N)) are indicated with R 2 adjT , while for each analysis performed on endemics and nonendemics separately variance is indicated with R 2 adjE , and R 2 adjN , respectively.
Upper boundary regression applied to all of the angiosperms in the dataset (Fig 1) showed a signi cant exponential decay trend (R 2 adjT = 0.48; R 2 adjE = 0.46; R 2 adjN = 0.44; p always lower than 0.0001) between the number of chromosomes and the estimated time since divergence, which was three times steeper in endemic species (k = 0.12) with respect to non-endemics (k = 0.04). In this analysis, half-life for the decline in maximum number of chromosomes over time was 12.9, 6.2 and 19.8 Ma for angiosperms in general, and for endemic and non-endemics, respectively.
In analyses performed on single families, results were varied, but it was possible to distinguish common patterns. The overall negative relationship between the number of chromosomes and taxon age was con rmed in the majority of the families analysed, with differing degrees of signi cance. In Campanulaceae (Fig 2 a-c), Compositae (Fig 2 d-f), Leguminosae (Fig 2 g-i), Poaceae (Fig 2 j-l) Caryophyllaceae (Fig S1 a-c, Online Resource 4), Rosaceae (Fig S1 d-f, Online Resource 4), the relationship was signi cant (e.g. for Campanulaceae: R 2 adjT = 0.42; R 2 adjE = 0.22; R 2 adjN = 0.32; for Compositae: R 2 adjT = 0.46; R 2 adjE = 0.44; R 2 adjN = 0.24; with p always lower than 0.0001), but similar exponential decay patterns were only evident as non-signi cant trends for endemic Ranunculaceae ( Fig   S1 g-i,, Online Resource 4; R 2 adjE with a negative value and p = 0.376) and Primulaceae (Fig S1 j-l, Online Resource 4; R 2 adjE = 0.065; p = 0.116). The relationship was stronger for endemics, as con rmed by the higher values of k (e.g. Compositae, k = 0.15 and 0.04 for endemics vs. non-endemics, respectively; Caryophyllaceae, k = 0.21 and 0.15 for endemics vs. non-endemics, respectively), and the entire family (e.g k = 0.11 for Compositae, or k = 0.15 for Caryophyllaceae). The negative trend was generally less signi cant for Apiaceae (R 2 adjT = 0.47; R 2 adjE = 0.04; R 2 adjN = 0.39; p = 0.07 for endemics, and lower than 0.0001 for the entire family and for non-endemics; Fig S2 a For Ericaceae, regressions were not signi cant (R 2 adj always lower than 0.01; p always higher than 0.1) and trends suggested variable and contrasting results (Fig 3): a non-signi cant exponential decay trend for chromosome counts with increasing taxon age was suggested for endemics (k = 0.05). When nonendemics were taken into account, the regression exhibited a positive slope (k = -0.002 at the family level, and k = -0.01 for non-endemics).
In contrast, Brassicaceae (Fig 4) and Euphorbiaceae (Fig S3, Online Resource 4) exhibited statistically signi cant negative slopes in non-endemics and the family as a whole, while slopes for endemics were not signi cant (negative values for R 2 adj and p~0.7) with increasing tendencies with taxon age (k = -0.03 and k = -0.009, respectively).
Finally, Nymphaeaceae (Fig 5) showed an overall, signi cant positive relationship, with k = -0.04, R 2 adjT = 0.29 and p = 0.03. A similar, but non-signi cant, tendency was shown in Saxifragaceae (k = -0.02 for the family and k = -0.01 for non-endemics, Fig S4 a-c, Online Resource 4), with negative values for R 2 adj and p always higher than 0.3.
The proportion of polyploid taxa within each family was found to differ substantially between families (Fig S5, Online Resource 4): polyploidy was evident for the majority of taxa in Violaceae (92%), Primulaceae (76%) and Campanulaceae (68%), while it occured in less than 10% of species of the Leguminosae (8%) and Orchidaceae (6%).

Discussion
The results broadly support the hypotheses that polyploidy is particularly evident in recently diverged angiosperms (hypothesis 1), especially for recent endemic species (hypothesis 2). While this phenomenon was evident for most of the families investigated, it was not always observed, and the hypothesis of a mechanism working consistently across the angiosperms (hypothesis 3) was only partially supported. The distribution of chromosome counts with time (i.e. towards recent ages) highlighted the pattern of progressive multiplication of the chromosome set, with high concentrations of records corresponding to diploid, tetraploid and hexaploid counts (2n=2x, 2n=4x and 2n=6x, respectively): this is particularly evident in Apiaceae, Caryophyllaceae, Ranunculaceae and Rosaceae. Thus, polyploidization appears to be an important mechanism for the emergence of new species on a general level.
As con rmed by the steeper decay for the regression curves of endemic species, the relationship between high number of chromosomes and recent origin is especially strong for species with a limited distribution, and in particular Campanulaceae, Compositae, Leguminosae, Poaceae and Caryophyllaceae, which represent some of the largest and ecologically most important families. However, the considerable number of recent diploids, which necessitated the use of upper boundary regression to highlight the effects of polyploid counts, demonstrate that polyploidy is not an exclusive driver of the emergence of new species, either for endemics or non-endemics, and suggest that additional mechanisms are involved (e.g. geographic isolation, reproductive isolation, hybridization). This could also explain the weaker regression and the gentler slope for endemic, compared to non-endemic, Apiaceae and the general analysis.
However, the contrasting results for different families are evidence that phylogenetic effects operating within each family (revealed by the direction, range and variability of patterns) affect the tendency towards polyploidy. This is also supported by the different tendency for polyploidy evident for the families included in the analysis (Fig S5, Online Resource 4). In Ericaceae the regression analyses were not signi cant: it is likely that other prevailing mechanisms drive the emergence of new ericaceous species, for both endemics and non-endemics. Despite showing an overall negative tendency between the number of chromosomes and taxon age, a similar interpretation could explain the weak signi cance for Orchidaceae, also revealed by the highly variable 95% con dence intervals.
In Saxifragaceae, Magnoliaceae, Nymphaeaceae, Violaceae and perhaps Ericaceae, results were most likely affected by the limited availability of data. This could also explain the lower signi cance of the analyses for endemics and non-endemics with respect to the general family-level analysis in Primulaceae, and the atypical trend in endemic Euphorbiaceae and Brassicaceae, characterised by wide and irregular 95% con dence intervals. In particular, the estimated age for endemic Brassicaceae did not exceed 9 Ma, with only seven species older than 3 Ma (i.e. Cochlearia aragonensis, Draba hederifolia, Streptanthus glandulosus, Vella asperum, V. bourgaeana, V. pseudocytisus, V. spinosa). In these situations, the nature of the data requires careful consideration: in this example, four out of seven species belonging to the same genus, Vella, are endemics of Spain and were dated relying essentially on a single study (i.e. Simón-Porcar et al. 2015), and thus on a single dating approach. The independence of observations could not be assured, since V. asperum, V. bourgaeana and V. pseudocytisus are closely related (Simón-Porcar et al. 2015; see also Siljak-Yakovlev and Peruzzi 2012). This is likely to have effects on species distribution: sharing the same habit and certain morphological traits, the three species were also found in the same habitat (xerophytic shrublands on gypsum substrate, characterised by anthropic disturbance) (Gómez Campo 1993;Simón-Porcar et al. 2015). With regard to karyotypes, the three species share the same basic chromosome number (x=17), but while V. bourgaeana is diploid (2n=2x=34), V. pseudocytisus is mainly tetraploid (2n=4x=68), and V. asperum is hexaploid (2n=6x=102) (Simón-Porcar et al. 2015), further con rming the role of polyploidy in speciation events. For these reasons, detailed here for a single prominent case, all analyses based on a restricted number of records should be interpreted with caution.
Additionally, the analyses that showed weaker regressions or positive trends were generally for older families, according to the age estimates provided by TTOF (Table S3,  This suggests that for certain ancient clades polyploidization may not be the main driver of speciation, although Rosaceae, the oldest clade (106 Ma), agreed with the hypothesis of higher polyploidy occurrence in recently diverged species. It is noticeable that some of these ancient families (i.e. Nymphaeaceae, Saxifragaceae, Rosaceae) showed high percentages of polyploid taxa, sensu Wood et al. 2009 (higher than 45%, Fig S5, Online Resource 4). In Saxifragaceae, a lack of relationship over time and high number of chromosomes in older taxa could indicate an initial burst of speciation via polyploidy followed by a lesser involvement of polyploidy in speciation. Therefore, polyploidy could be relatively widespread even across ancient families, but it is evidently not the main driver of speciation trends through time for these families.
Why recent polyploids are relegated to limited ranges is not immediately evident from our dataset. As con rmed by a body of research, polyploids are often adaptable species able to survive in harsh environmental contexts, and thus advantaged when colonizing new habitats; this tendency has been observed across the main angiosperm clades (Flovik 1940;Brochmann et al. 2004;Manzaneda et al. 2012;te Beest et al. 2012;Mas de Xaxars et al. 2016;Paule et al. 2018;Stevens et al. 2020). Intriguingly, the alteration of phenotype by polyploidy suggests that plant functioning and tness may be fundamentally changed, but a preliminary classi cation of the species in our dataset according to Grime's CSR ecological strategies (method of Pierce et al. 2017) showed that no particular strategy class was associated with polyploidy: species adapted to survive competition, stress or disturbance all exhibited an extremely wide range of chromosome numbers (data not shown). Rather than re ecting tness and adaptation, the high incidence of endemic polyploids could depend mainly on limited time for dispersal and colonizing new areas, as has been suggested for certain species (Behroozian et al. 2020). Indeed, it has been determined that polyploid plant species rely mainly on vegetative reproduction (Herben et al. 2017), which is usually ineffective for wide or rapid dispersal (Winkler and Fischer 2001;Herben et al. 2016). Indeed, the rearrangements of the chromosome set and the encumbrance created by multiple chromosomes in polyploid cells hinder the meiotic process and thus impose disadvantages for sexual reproduction (Herben et al. 2017). On this premise, it is likely that neo-polyploids tend to remain limited to a narrow range (and, thus, apo-endemic) with respect to recently diverged diploids.
Increased availability of data regarding species molecular dating, chromosome counts and distributions will provide further support to this analysis. This particularly applies to families with less complete records (e.g. Ericaceae, Magnoliaceae, Nymphaeaceae), in order to determine whether contrasting results for these families are indicative of truly different patterns or are data artifacts. This is of particular concern for ancient clades, highlighting possible changes in prevailing speciation mechanisms throughout evolutionary time.

Conclusion
The hypotheses were supported: chromosome duplication is more prevalent in recent angiosperms, in particular young endemics, con rming the role of polyploidy as a key driver of neo-endemism throughout the owering plants. This pattern was generally, but not always, evident across owering plant families, but some cases in which patterns were lacking may re ect insu cient data availability. Vegetative reproduction is a typical feature of polyploid taxa, suggesting that, despite a potentially greater adaptive capability, neo-polyploids tend to remain relegated to small ranges (compared to diploids) due to lower dispersal potential. However, the majority of young species (both endemics and non-endemics) are diploid, and thus polyploidy is not an exclusive driver of neo-endemism. Ancient clades (Orchidaceae, Magnoliaceae, Nymphaeaceae, Saxifragaceae and Ericaceae) exhibit weaker or contrasting trends, suggesting that genome doubling has become established as a key speciation mechanism mainly in more recent clades.

Declarations
Funding: SV was supported by a Ph.D. scholarship from the Department of Agricultural and Environmental Sciences, University of Milan, as part of the 35 th round of funding in the Doctoral School of Agriculture, Environment and Bioenergy (Dottorato di Ricerca in Agricoltura, Ambiente e Bioenergia).
Con icts of interest/Competing interests: The authors declare that they have no con ict of interest.
Availability of data and material: The dataset is available in Microsoft Excel format as Table S1 (Online The relationship between time since taxon divergence and number of chromosomes (as a proxy for ploidy level) across all major clades of the Angiosperms, applied to: a) the entire dataset; b) endemic species; c) non-endemic species. Squares represent endemic species, circles represent non-endemic species. Broken lines represent 'upper boundary regressions', or 3-parameter Lorentzian regressions tted to the ve highest values in each 'bin' or 1 million year interval (bin value data points are lled in dark grey; points under the upper boundary curve are un lled). Dotted lines represent the ±95% con dence interval Figure 2 The relationship between time since taxon divergence and number of chromosomes for examples of angiosperm families exhibiting a declining upper boundary relationship: a-c) Campanulaceae, all spp., endemic spp. and non-endemic spp., respectively; d-f) Compositae, g-i) Leguminosae; j-l) Poaceae.
Squares represent endemic species, circles represent non-endemic species. Broken lines represent 'upper boundary regressions', or 3-parameter Lorentzian regressions tted to the ve highest values in each 'bin' or 1 million year interval (bin value data points are lled in dark grey; points under the upper boundary curve are un lled). Dotted lines = ±95% C.I. Note that x and y data ranges are different for each family

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.