Toward understanding the genetic mechanisms of speciation: Fine-grained introgressions between species

: The biological species concept (BSC) is the prevailing definition of species whereby genes cannot be exchanged during or after speciation. In contrast, since only genes that contribute to the adaptive divergence between species should be non-exchangeable, BSC appears to bear little relationship to the genetic process of speciation. The rejection of BSC demands evidence of continual gene flow until the completion of speciation. Here, we carry out the sequencing and de novo high-quality assembly of the genomes of two closely related mangrove species ( Rhizophora mucronata and R. stylosa ). Whole-genome re-sequencing of individuals from their distributions on the tropical coasts shows their genomes to be well delineated in allopatry. They became sympatric in northeastern Australia where their genomes harbor 9,963 and 3,874 introgression blocks, respectively, with each block averaging only about 3-4 Kb. These fine-grained introgressions indicate that gene flow continues even after numerous loci have evolved to become non-introgressable “genomic islets”. Many of these islets, only 1.4 Kb on average, harbor “speciation loci” which contribute to the divergence in flower development or gamete production and thus result in fitness reduction upon introgression. Our fine-grained analysis may thus be the first one to show that gene flow not only happens during speciation, but often continues until the completion of speciation. Under the weight of the extensive literature, BSC may deserve to be retired.


Introduction
The biological species concept (BSC) has been the gold standard of modern evolutionary biology [1][2][3][4][5][6][7][8][9][10] . In this concept, species are products of allopatric speciation during which geographical isolation ensures the absence of gene flow and permits continual divergence. Here, we will use the term BSC broadly for both the process of speciation and the concept of species. BSC makes clear assumptions about the genetics of species divergence, postulating that nearly the entire genome evolves as a cohesive unit 1 . BSC is essentially a genomic concept of species 2,5 .
In an alternative genic view, species are defined by a set of loci that govern the morphological, reproductive, behavioral and ecological characters. These "speciation genes" may, collectively, account for no more than a fraction of the genome [11][12][13] . This fraction should be fitness-reducing upon introgression 14 , whereas the rest of the genome can be freely exchanged without a fitness consequence. In short, the diverging genomes comprise both introgressable and non-introgressable DNA segments. These non-introgressable segments are often referred to as "genomic islands" which are, in theory, more divergent than the rest of the genome 2, [15][16][17][18] . Based on this conjecture, genomic islands have been identified in a wide array of speciation events 16,17,26,27,[18][19][20][21][22][23][24][25] . It would seem timely to propose that, because substantial gene flow often accompanies speciation, BSC should be rejected.
In this backdrop, Wang et al. (2020) 5 recently offer a vigorous defense of BSC, which in fact expects gene flow during speciation because geographical isolation (for example, at the Isthmus of Panama 28 ) could only be built up gradually. In other words, BSC only assumes that speciation should end with a period of strict isolation but does not specify how it should begin (see Fig. 1 of ref 5 ). The extensive genomic literature is often about introgressions in the early stages of speciation, which is not incompatible with the generalized BSC. Therefore, if a definitive proof of continual gene flow throughout speciation can be obtained, the extensive literature, collectively, should be able to reject BSC. The subject has been further explored in a spirited debate on the issues raised by Wang et al. 5,29-32 .
The issue of "speciation with gene flow" is important at all levels of evolutionary biology, ranging from the taxonomic practice of delineating species, the mechanisms of creating biodiversity hotspots to the nature of genic interactions in speciation. We further show that, by focusing on the small genomic islands with no gene flow, many speciation genes can be efficiently identified. Fig. 1A shows a hypothetical segment of the genome in an early, middle and late stage of speciation 2,15,16,33 . Importantly, by the late stages of speciation, many phenotypic characters underpinned by numerous loci of adaptive differences are assumed to have evolved. Such ensembles of functionally divergent genes between species and racial groups have been extensively reported 2,11,37,38,[12][13][14]16,[33][34][35][36] . In this genic view of speciation 2,13 , gene flow continues but the introgressions would gradually decrease in size and may become too fine-grained to identify. In this framework, we study two closely related mangrove species -Rhizophora mucronata vs. R. stylosa 7,[39][40][41][42][43][44][45] . Mangroves are woody plants that have colonized the intertidal zones of tropical coasts 7,39,40,46 . Because of the narrow band of suitable habitats along the coasts (or near river mouths), the global distributions of mangroves are essentially one dimensional, making them ideal for biogeographical studies of speciation.

De novo assembly of the genomes of R. mucronata and R. stylosa
We sequenced the genome of one R. mucronata and one R. stylosa individual, with estimated genome size of ~300 Mb and ~370 Mb, respectively (Supplementary Table S1; see Supplementary  Table S2 for details). Although re-sequencing of these two genomes have been done before 39 , the aims of this study demand accurate assembly of their own genomes in order to assess possible fine-grained introgressions. We therefore carry out de-novo sequencing of their genomes and attain the chromosome-level assembly in this study. Re-sequencing of individuals from their geographical ranges of distribution is based on the new assembly.
We obtained 33.84 Gb (gigabases) and 36.13 Gb of raw data, corresponding to 142X coverage of the assembled genomes (Supplementary Table S2 Tables S1 and S2). These results indicate high-quality de novo genomes approaching chromosome-level completeness. A total of 26,540 and 30,375 protein-coding genes in the R. mucronata and R. stylosa genome (Supplementary Table S1) are, respectively, classified into 19,089 and 19,018 gene families.
To see the evolution of the genomic structures of R. mucronata and R. stylosa in the larger phylogenetic context, we compare their genomes to those of Carallia longipes (unpublished data), Bruguiera gymnorrhiza 46 and Rhizophora apiculata 39 ; C. longipes being from one of the closest non-mangrove genera in the Rhizophoraceae family. In total, 20,971 gene families are identified among the five species, with 6,014 are single-copy. Using these single-copy orthologs, we reconstruct the phylogeny and estimate divergence times (Fig. 1C). The phylogeny shows that R. mucronata and R. stylosa are the most closely-related pair diverging at ~3.78 Mya (with the 95% confidence interval at [3.15, 4.35] Mya; Fig. 1C and Supplementary Table S3 and Fig. S3). The inter-specific Ks (~0.0031) and the genomic divergence (Dxy = 0.0031) all support the close relationship between R. mucronata and R. stylosa (Supplementary Table S2 and Fig. S4). We identify 663 collinear blocks between the two species that harbor 18,705 genes in R. mucronata and 18,952 genes in R. stylosa (Fig. 1B).

Population genomic diversity within R. mucronata and R. stylosa
Rhizophora mucronata has a wide distribution in the Indo-Western Pacific (IWP), particularly to the west of the Strait of Malacca and all the way to East Africa. In contrast, R. stylosa extends eastward from the Strait of Malacca to the western Pacific Islands (Fig. 2A). The two species have been reported to overlap in scattered locales along a number of western Pacific coastlines. However, in our own field trips, their relative abundance is often skewed in favor of one species and their co-occurrence has been rarely found. The sole exception in our collection is in the Daintree River (DR) area of northeastern Australia, where both species are quite abundant ( Fig. 2A). It is the evidence from this site of sympatry that is instructive about the genic makeup of species.
For population genomic studies, 21 R. stylosa individuals from four locations (labeled s1-s4) and 31 R. mucronata individuals from seven locations (named m1-m7) were analyzed ( Fig. 2A Table S5 and Table S6). The short reads of each individual were mapped to the de novo R. mucronata genome, with average genomic coverage of 81% (80%-83%) (Supplementary Table S4). The level of genetic diversity shows two patterns. Low genetic diversity is found in all allopatric populations (average θπ at 0.62 and 0.60 per Kb for R. mucronata m2-m7 and R. stylosa s2-s4, respectively (see also Supplementary Tables S2 and S4). The level is much higher in the sympatric DR populations (θπ = 1.37/Kb and 2.09/Kb, respectively). The Watterson's estimates (θw) are similar (Supplementary Table S4, see Materials and Methods).

Divergence between the two species in allopatry
The genomic divergence between the two species is 4.14 x 10 -3 (Supplementary Table S7). We first constructed a Maximum Likelihood (ML) tree using RAXML 47 on the sequences of the 31 R. mucronata and 21 R. stylosa individuals from the 11 populations. The ML tree bifurcates with a clear delineation between species across all allopatric populations. However, the m1 and s1 (i.e., DR) samples show strong signs of admixture as they are "in the middle" of the bifurcated tree ( Fig. 2A). When the DR samples are removed, the phylogeny shows clear delineation (Fig. 2B). These two trees are robust when rebuilt using the ML method in IQTREE 48 or the Neighbour-Joining (NJ) method in MEGA7 49 (Supplementary Fig. S5 and S6). The monophyletic delineation of R. mucronata and R. stylosa in allopatry is also supported by the principal component analysis (PCA) 50 (Supplementary Fig.  S7).
In total, 1.7 million variable sites are detected across all populations of the two species (Supplementary Table S8). We first partition these sites by excluding the DR samples (see Materials and Methods). Each site is then represented by an FST value with FST = 0 indicating no differentiation between the two species in allopatry and FST = 1 indicating complete differentiation. Figure 2C shows the U-shaped distribution whereby the abundance of sites at the far right reveals the extensive differentiation between species. Such a U-shape distribution is typical of species of some divergence with little gene flow 51 .
Morphologically, the two species are distinguished by style length 40,41 , as pictured in Fig. 3A. The morphological differences between R. mucronata and R. stylosa across populations are shown in Fig.  3C. R. mucronata is readily distinguished by its short style, in the range of 0.9-1.6mm (Fig. 3A). In contrast, the style of R. stylosa is long, 2.4-5.3 mm (Fig. 3A) with no overlap between the two species ( Fig. 3C) 40,41 . While the style length varies from locale to locale in both species, this trait is a species-diagnostic one across locales. The two species also show different habitat preferences with R. mucronata usually found upriver while R. stylosa is close to the river mouth (Fig. 3D). Additional diagnostic morphological characters, which are less stable, are listed in the Supplement ( Table S9).

Characterizations of R. mucronata and R. stylosa in sympatry (the DR samples)
Morphological characters of the sympatric DR samples, which appear admixed in their DNA sequences, remain distinct. The style length of each sample is concordant with that of the allopatric populations of the same species (Fig. 3B). In fact, the two species in all sympatric populations can be clearly delineated by this character (see Fig. 3C). In the DR area, these two species are parapatric-sympatric with distributions up-or down-river and extensive overlaps in the middle (Fig.  3D). This difference in habitat preference is seen everywhere 41,42 .
Corroborating the phylogenetic positions of the DR samples in Fig. 2A, we use the Bayesian clustering analysis, ADMIXTURE 52 . The analysis identifies two genetic components that make up the genomes of the DR samples (Fig. 4A). PCA results also indicate significant admixture in m1 and s1 individuals ( Supplementary Fig. S7). Furthermore, because species divergence is monophyletic in all allopatric comparisons, incomplete lineage sorting as the cause of the observed admixture in the DR samples is rejected. In short, we interpret the high FST sites as manifesting the divergence after speciation ( Fig. 2C) with subsequent admixture in the DR area. Additional tests of introgression (LD analysis, D statistic and the modified fd statistic) are presented in the Supplement (Tables S8, S10 and  Figs. S8-S9).

Extensive introgressions in sympatry
For the two species in sympatry in the DR area, we ask the following questions: 1) How many introgressed segments can be found in each species? 2) Is the introgression symmetric between the two species? 3) How fine-grained are the introgressed segments (i.e., many small segments or a few large ones)? A few large blocks are expected after recent hybridizations but many fine-grained blocks may result from old introgressions that have been eroded by recombination. If true, introgression (and hence speciation itself) might be a prolonged process. 4) How many genomic segments are non-introgressable and what are their genic contents? Question 4 will be the subject of the next section.
To quantify the introgressions in the DR area, we first define the divergent sites (or d-sites) between R. stylosa and R. mucronata in allopatry. Among the d-sites, we can then define the introgression sites (i-sites) between the m1 and s1 samples in sympatry. There are 305,418 d-sites, defined as sites with FST > 0.8 between the two species in allopatry. Note that the bulk of d-sites (212,626 sites) are fully divergent with FST = 1.0 (Fig. 2C). An i-site is where the introgressed allele (or i-allele) is found in >= n of the 10 genomes in the m1 or s1 samples, where n is usually equal to 8. (Note that both m1 and s1 samples have five diploid individuals, or 10 genomes.) In general, if introgression is observed in one direction, say from R. stylosa to R. mucronata in the DR area, the same site usually does not show introgression (n <= 1) in the reciprocal direction from R. mucronata to R. stylosa (see Materials and Methods).
We now define n (the number of genomes carrying the introgressed allele), which is necessary for determining the i-sites. It is obviously better to set n close to the maximum of 10 for strongly penetrant introgressions. Fig. 4B shows the level of introgressions in the two directions. We set n = 8 for the m1 samples where the i-allele is usually found >= 8 times (the orange bars in Fig. 4B). Hence, the results with n= 2 and n = 8 would not be very different. Furthermore, to avoid the confounding presence of remnant ancient polymorphisms, we require introgressions at an i-site to be strongly asymmetric: >= n one way (say, from R. stylosa to R. mucronata) and <= 1 in the reciprocal direction ( Supplementary  Fig. S10). For the R. stylosa (s1) samples, the occurrence of the i-allele is rather even between 2 and 10 (the green bars in Fig. 4B). The asymmetry is probably due to the geography of the DR area, which is at the fringe of the R. mucronata distribution. Consequently, gene flow from R. mucronata into R. stylosa may be more limited here, resulting in the lower frequency of introgressions in the s1 samples. In this regard, setting n = 8 would miss many introgressions in R. stylosa leading to a much lower introgression rate than in R. mucronata. Nevertheless, the final estimations appear robust even when n is set as low as 2 (see below). Simulations of these scenarios are presented in the Supplement.
Clearly, introgressions do not happen site-by-site, but appear as long segments of DNA consisting of consecutive i-sites. We shall label these segments "introgression blocks" (or i-blocks) ( Supplementary Fig. S11). Fig. 5A shows a segment of the genome that comprises a string of d-sites and i-sites as defined above. These d-sites and i-sites are embedded in a background of low-FST or invariant sites shown as dots in Fig. 5A. This figure shows 3 i-blocks, each consisting of one, two or three i-sites. The length of each block is defined by the distance between the two breakpoints flanking the block. Unless specified, we remove the singleton i-blocks that harbor only a single i-site when presenting the length distribution of i-blocks.
The analysis of i-blocks is summarized in Table 1 (see also Supplementary Tables S11-S13). We shall focus on the results with n = 8 but the results with n = 2 and n = 10 are given for comparison. In the DR area, samples of R. mucronata (m1) harbor far more introgressions than those of R. stylosa (s1). The bottom of Table 1 at n=8 shows that 16.09 or 23.09% of the R. mucronata genomes are introgressions from R. stylosa, the two values depending on whether singleton i-blocks are counted. In the opposite direction, 7.97-12.06% of the R. stylosa genomes are introgressions. The introgressions of Table 1 can be visualized in Figs. 5-6. The salient observation is the highly fine-grained nature of the introgressions. In R. mucronata, the introgressions are distributed over 9,963 i-blocks with an average length of 3. 24 Kb. In R. stylosa, there are 3,874 i-blocks with an average size of 4. 13 Kb. During the evolution, there should be numerous recombination events that break the introgressions into thousands of tiny i-blocks. It should be noted that Table 1 and Figs. 5-6 present only the extreme cases of introgressions that rise to very high frequencies.
The distributions of i-blocks are shown at the large genomic scale in Fig. 5B, at the scaffold scale in Fig. 5C and as individual sites in Fig. 6A-6C. Note that only d-sites and i-sites are portrayed in these figures, which convey the visual impression of the fine-grained nature of the i-blocks. (As shown in Fig. 2C, the d-and i-sites are the 305,418 sites with FST > 0.8; the rest are invariant and lowly divergent sites.) Specifically, the i-blocks are dispersed across the whole genome ( Fig. 5B and Supplementary  Fig. S12). Indeed, all top 18 scaffolds harbor the switching between i-and d-blocks both in m1 and s1 genomes, massively and frequently ( Fig. 5B and Table 1). Figure 5C shows that the switching between i-and d-blocks can occur in a few to tens of Kbs. At the site level, i-blocks and d-blocks can switch within a small distance ( Fig. 6A-6C). An i-block (or d-block) may harbor only one i-site (or d-site), referred to as singleton block ( Fig. 6A-6C, Table 1 and Supplementary Table S11). Singleton blocks, not uncommon but less reliable, are not used in the tally.
The extensive fine-grained introgressions convey two messages. First, hybridizations may happen continually over a long span of time. Each hybridization event would initially bring in whole-chromosome introgressions that are subsequently broken down by recombination. Small DNA fragments may have been introgressed in this piece-meal manner continually. Second, loci underlying differential adaptation between species may be very common such that introgressions tend to be small, and thus free of the introgressed alleles that are deleterious in the genetic background of another species 14 . In the next section, we will direct the attention toward non-introgressions, which are blocks of native alleles flanked by introgressed DNA segments.

Very fine-grained interspersion between "introgressable" and "non-introgressable" blocks
Some DNA segments may not be introgressable due to the presence of genes of adaptive significance. Such loci, by definition, contribute to reproductive isolation or ecological speciation 2,9 and have sometimes been referred to "speciation genes" 8,10,12,13,34,53,54 . The number, size and direction of introgressions are therefore functions of a number of parameters: 1) the rate of hybridization; 2) the strength of selection against the speciation genes when introgressed; 3) the number and location of speciation loci; 4) the rate of recombination that free neutral genes from the linkage to speciation genes; and 5) the length of time since the time of initial hybridization.
In a companion study 5 , we carry out computer simulations based on the Recurrent Selection and Backcross (RSB) model 55 (see also Materials and Methods). The RSB model is proposed for identifying genes of complex traits 55 . In its execution, one dilutes the genome of breed A (say, the bull dog) with that of breed B (e.g., the border collie) but retains all the desired phenotypic traits of the former. This is done by continually selecting for the traits of breed A while backcrossing the culled products to breed B. The scheme is almost identical with the process of "speciation with gene flow" in its model structure. They differ only in the parameter values; for example, the length of time in speciation is far greater and gene flow is much smaller, and often bidirectional. The differences necessitate separate simulations for speciation with gene flow. As shown in Fig. 2 of Ref. 5  With the j-site defined above, a j-block is defined as a DNA segment containing at least one j-site ( Table 2). By these stringent criteria, there are ~1200 j-blocks which together account for < 1% of the genome and harbor 328 coding genes of which 171 genes contain j-sites ( Table 2, Supplementary  Table S14 and Fig. S14). For higher confidence, we also show j-blocks with at least two j-sites ( Table  2). While only 19 genes containing j-sites are found in these j-blocks (Supplementary Table S15), it is remarkable that 6 of the 19 genes function in flower development and/or gamete production and development as shown in Table 2 (see the WEGO gene ontology in Supplementary Fig. S14, where a larger set of genes is presented under less stringent criteria). One (RM_77078.7) of the six genes, known as EMF1 (embryonic flower 1), regulates reproductive development and is involved in controlling flowering development 56,57 . RM_76773.10, RM_76979.9 (NAC2) and RM_77530.24 are involved in regulating the stamen development, pollen germination and tube growth [58][59][60] . RM_76929.10, RM_76979 and RM_77333.68 all play a role in embryonic development [61][62][63] . Since all six genes contain highly differentiated amino acids and non-introgressable sites (j-sites) ( Table 2,  Supplementary Table S15 and Fig. S15), their involvement in the speciation between R. mucronata and R. stylosa seems plausible.

Discussion
The species of R. mucronata and R. stylosa in the DR area are unusual among sympatric species reported in the literature 40,44,45 as explained below. These features may be the reason that they are ideal for a rigorous test of BSC by looking for fine-scale introgressions in sympatry, vis-à -vis the populations in allopatry ( Fig. 2A and Fig. 5B). Most important, the fine-scale analysis permits the identification of many genomic islets (~1.4 kb on average) that harbor speciation loci. Genic divergence at these loci contributes to the divergence in flower development or gamete production.
The DR populations stand out even among populations from other sympatric locations between the two species. For example, the m2/s2 collections from Singapore both show the expected phylogenetic relationship of their species designation ( Fig. 2A and Fig. 2B). This pattern of little admixture is consistently found in other locales: in Brandan, Indonesia 44 , in Panay Island, Philippines, in Kosrae, Micronesia, in Yap, Micronesia and in North Sulawesi, Indonesia 45 . The two species in sympatry outside of the DR area occasionally show a slight tendency of being "on the fringe" of their phylogenetic cluster, thus suggesting low-level introgressions. Nevertheless, the extensive fine-grained introgression observed in the DR samples has not been reported before. Importantly, Yan et al. 45 did notice samples from northeastern Australia (from Trinity Inlet and Daintree River, Australia) to be different without further clarification.
The near absence of prior reports of fine-grained introgression is understandable as several conditions have to be met for this phenomenon to be realized. The first condition is that the two diverging populations have to be in the right stage when they first come into contact. This stage roughly corresponds to Stage III defined in Wu (2001) 2 whereby speciation is nearly complete but gene flow is still possible. Had the secondary contact happened before this stage (in Stage II, for example), the process of speciation could be arrested or reversed. On the other hand, if the contact starts too late, there would be too little gene flow to have the extensive introgressions observed in the DR area.
Among the sympatric locations reported for R. mucronata and R. stylosa, northern Australia has been suggested as the place where the two species first came into contact in their incipient stage of speciation (see the Supplement) 40 . In this interpretation, the two diverging taxa moved eastward either along the northern coasts of Indian Ocean to Southeast Asia or by crossing the Indian Ocean to Australia 40 . R. mucronata in Southeast Asia then dispersed south and eastward to Australia, while R. stylosa in Australia migrated further east and north into SE Asia (see the Supplement) 40 . By the time the two diverging species were reunited in sympatry outside of northern Australia, the two species would have developed some reproductive isolation which hindered exchanging of genes, thus explaining the unmixed phylogenetic relationship in these other locales.
The second condition may be even more difficult to satisfythat the two species need to remain in secondary contact for a long period of time 4 . As discussed, numerous recombination events accumulated over a long period of time are necessary to achieve the fine-grained introgression. Non-trivial gene flow during secondary contact may also prevent further build-ups of functional divergence that would lead to the complete cessation of gene flow 2 .
The third condition is ecological. Two sympatric species without niche separation would face the problem of competitive exclusion 64 , making long-term coexistence unlikely. R. mucronata and R. stylosa had evolved a degree of niche separation that results in incomplete overlaps in habit preference (Fig. 3D). Given the necessary confluence of all these conditions, R. mucronata and R. stylosa in the DR area may be truly exceptional. Instead of a few large "genomic islands" 19-21,24-26 , we observe in the DR samples a large number of tiny islets. A precedent is the genomic island surrounding the speciation gene, Odysseus, which is < 2 Kb in size 16 and, hence, an islet. Small islets are obviously conducive for the identification of "speciation genes" as only a few candidate genes are involved (see Table 2 and Supplementary Table S15).
These conditions make this study particularly suited to the inquiry of speciation with continual gene flow. It raises the question of whether the extensive reports on speciation with gene flow 15 Finally, BSC's requirement of complete cessation of gene flow, imposed by strict geographical isolation, has posed many dilemmas. There are simply too few stable geographical features like the Isthmus of Panama to facilitate strict allopatric speciation 6,7 . Furthermore, such a geographical feature does not help to explain the teeming biodiversity on the Indo-western Pacific coasts 7,71 , in the Amazons 6 or in Lake Victoria 72 . In this sense, BSC is conceptually flawed as pointed out repeatedly 2 . Furthermore, genomic data that have sufficient resolution to provide a meaningful test have consistently failed to support BSC 7,13,14,[73][74][75][76][77] . Had BSC been an imperfect guide in the actual practice of species delineation, it may still be useful, but it is not. It may therefore be time to seriously consider abandoning BSC in the curricula of evolutionary biology.

Plant material, genome sequencing and assembly
The sequenced R. mucronata and R. stylosa individuals were collected from Dongzhai Harbor, Hainan, China (110°35'5.79'' E, 19°56'39.67'' N) (Supplementary Table S2), although the R. mucronata individual was originally introduced from Australia 78 . Genomic DNA extraction from leaves was done following the CTAB method 79 . Total RNA was extracted from leaves, flowers and fruits using the modified CTAB method 80 . Short-read libraries were constructed and sequenced using the BGISEQ-500 platform. 50 Kb long-read libraries were prepared using the 10X Genomics (Illumina Hiseq X Ten) platform. Before assembling, low-quality reads, adaptor sequence, N and polyA contamination were filtered out. The 10X Genomics data was used to assemble a draft genome using SUPERNOVA 81 . To calibrate and refine the assembly, we used clean short reads and Hi-C reads based on the 3D-DNA (3D DNA de novo genome assembly) pipeline (https://github.com/theaidenlab/3d-dna). After manual check and calibration with Juicabox (https://github.com/aidenlab/Juicebox), we anchored 18 pseudo-chromosomes (chr1-18) both for R. mucronata (200.70 Mb, or 84.38%) and R. stylosa (219.66 Mb, or 86.63%) genomes, corresponding to their diploid chromosome number of 2n = 36 (Supplementary Table S1) 82,83 . The transcriptome data was used to perform protein-coding gene annotation (26,540 genes in R. mucronata and 30,375 genes in R. stylosa; Supplementary Table S1). Genome completeness (94.2% for R. mucronta and 94.3% for R. stylosa) was assessed using the lineage database eudicotyledons_odb10 and the BUSCO software (https://github.com/c-omics/busco) (Supplementary Table S1). The raw reads were mapped to the de novo genome using the Burrows-Wheeler Aligner (BWA) 84,85 . The mapping rates were as high as 94.52% for R. mucronta and 92.74% for R. stylosa (Supplementary Table S2).

Collinearity analysis
The collinearity analysis were done within and between R. mucronata and R. stylosa genomes. Alignment was performed on protein sequences of R. mucronata and R. stylosa using BLASTP (with a cutoff e-value of 10 −10 , identity ≥ 40%). We then used MCScanX 86 to identify syntenic or collinear blocks, with blocks having at least five paired homologous genes accepted. Genomic distribution of collinear blocks was visualized using the Circos (v0.65) software ( Fig. 1B and Supplementary Figs. S1-S2) 87 . 267 and 305 syntenic blocks were identified within the R. mucronata and R. stylosa genome, occupying up to 32.01% (8,496 of all 26,540 genes) and 29.61% (8,995 of all 30,375 genes) genes of each genome (Supplementary Figs. S1-S2). Through collinearity analysis between R. mucronata and R. stylosa, we found 663 inter-specific collinear blocks with 18,705 R. mucronata genes (70.48% of all 26,540 genes) and 18,952 R. stylosa genes (62.39% of all 30,375 genes) involved (Fig. 1B).

Gene family analysis and divergence time estimation
To estimate divergence times, we used the de novo genomes of Carallia longipes (unpublished data), Bruguiera gymnorrhiza 46 , Rhizophora apiculata 39 , R. mucronata and R. stylosa. We used OrthoFinder-2.2.7 88 to identify gene families 85,89 . All proteins from these five species were merged to perform an all-to-all alignment using DIAMOND (with a cutoff e-value of 10 −30 ) 90 . 20,971 gene families are identified among the five species. 6,014 of them contain only one gene in each species (i.e., single-copy orthologs). Genes from the five species that fall into shared single-copy orthologs were aligned using MUSCLE 91 and codon sequences were obtained using PAL2NAL.v14 92 . We used JMODELTEST2 93 to select an appropriate nucleotide-substitution model for reconstructing the phylogeny. To reconstruct the phylogenetic tree, we used RAXML 47 and IQTREE 48 with the best-fit model (GTR+G) and 1000 bootstrap replicates (Supplementary Fig. S3). Finally, the program MCMCTREE from the PAML4.8 package 94 was employed to estimate divergence times (Fig. 1C, Supplementary Table S3 and Fig. S3), using "seq like (usedata = 1)", "JC69 (model = 0, alpha = 0)" and "independent rates (clock = 2)". The time constraints were based on the earliest known fossil records of mangrove lineages (hypocotyl fossils of Bruguiera in early Eocene (constraint 1: 47.8-56 Myr ago) and the oldest records of Rhizophora in the upper Eocene (constraint 2:33.9-38.0 Myr ago) 39,95-97 .

Sampling and genome re-sequencing
To make the samples of R. mucronata and R. stylosa more representative, we collected individuals both in allopatry and sympatry in the Indo-West Pacific region ( Fig. 2 and Supplementary Tables S4). We re-sequenced 31 R. mucronata individuals from seven populations and 21 R. stylosa individuals from four populations ( Fig. 2 and Supplementary Tables S4-S6). To tell apart the two species by morphology, we observed the style length and shape in the bud and took photos ( Fig. 3 and Supplementary Tables S9). Fresh leaves were sampled from individual trees and dried with silica gel. Genomic DNA extraction was done following the CTAB method 79 . Short-read libraries were sequenced using the Illumina Hiseq 2000 platform with insert size of 350bp and constructed following the TruSeq DNA Sample Preparation Guide. We obtained high quality sequence data for each individual genome, with coverage in the 12 to 22X range (Supplementary Tables S5-S6).

SNP calling and genetic diversity detection
To identify variants, clean reads from all 52 individuals were mapped to the de novo R. mucronata genome using the Burrows-Wheeler Aligner (BWA) 84 . SAMtools 98 were used to import, sort, and pair bam files and remove duplications. To obtain high quality variants, single nucleotide polymorphisms (SNPs) were called and filtered using the Genome Analysis Toolkit (GATK) 99 and the SAMtools/bcftools 98 pipeline. Only consensus SNPs called by both pipelines were retained for downstream analysis. To remove low-quality variants, we eliminated all loci that had base quality (Q) or mapping quality (q) smaller than 20. We additionally performed the following stringent filtering: 1) of an i-block is determined by the midpoint between the flanking (d-sites, i-sites) intervals (as shown in Fig. 5A). j-block: a genomic block with one or more j-sites continuously. We define the boundaries the same as for i-blocks.

Simulations of genomic sequences under hybridization, selection and recombination
To probe the influences of hybridization, selection and recombination on genomic sequences, we carry out computer simulations based on the Recurrent Selection and Backcross (RSB) model 55 . We set high and low levels for each parameter. Population size was set at 1000. The length of simulated sequences is 100 Kb (for convenience, 1 Kb is the basic unit that cannot be separated by recombination). The original allele in the sequence and an i-allele from the other species are differentially labeled. Hence, at the beginning of the simulations, the sequences of all individuals are in original alleles states (100 x). After several generations of hybridization, selection and recombination, the sequences become shuffled (Fig. 6D-E and Fig. S13).
We first set a low hybridization rate (introgression or migration rate, m = 0.001 per generation) and recombination (10E-6 per generation between adjacent base pairs). For every generation, 999 individuals are picked from the original population and one is from the other population (or species). The recombination probability (r) for a 100 Kb sequence is about 0.1. Since population size is 1000, there will be an average of 100 individuals with recombination in each generation. Two loci under negative selection (#51 and #71) are defined in the simulated sequences. If one or both loci harbor an i-allele, the relative fitness of this sequence is 0.95 (Fig. S13A) or 0.99 (Fig. S13B). We also examined a high introgression rate regime (10/1000). In this case, four loci (#41, #51, #71 and #76) were negatively selected (relative fitness = 0.95 for an i-allele) (Fig. S13C). The scenarios in Fig.  S13A show that a lower recombination rate (r = 0.1) increases the size of non-introgressed DNA segments, because the neutral genes near positions 51 and 71 were selected against along with the speciation loci. Figure S13B-C shows that a reduced selection intensity (s = -0.01) or a 10-fold higher introgression rate give rise to extensive introgressions. Interestingly, partial introgressions were detected even at positions 51 and 71, where selection acts against the invading alleles.
Finally, we simulated genomic sequences under a high recombination rate (10E-5, r = 1.0 for a 100 Kb simulated sequence per generation) and a low introgression rate (1/1000 per generation). Two loci (#51 and #71) were negatively selected (relative fitness = 0.95 for an i-allele) (Fig. 6D-E and Supplementary Fig. S14D-F). The simulations suggest that, given the right parameter values (Fig.  6D-E and Supplementary Fig. S14D-F), the pattern of introgression would follow exactly the prediction based solely on selection, whereby only the alleles of the speciation loci cannot be introgressed. The rest of the genome, even right next to the speciation loci, is freely shared between species.