Assessing sh diversity of Merbok Estuary and adjacent marine waters by DNA taxonomy: towards building comprehensive DNA barcode reference library

The Merbok Estuary comprises one of the largest remaining mangrove forests in Peninsular Malaysia. Its value is signicant as it provides both direct and indirect services to local and global communities. It also offers a unique opportunity to study the structure and functioning of mangrove ecosystems. However, its biodiversity is still partially described, limiting its research value. Recent inventories based on morphological examination, documented 138 sh species residing, frequenting or subject to entering the Merbok Estuary. Using a molecular approach, we assessed the sh diversity of the Merbok Estuary and its adjacent waters in DNA barcoding 350 specimens assignable to 135 species initially identied based on morphology. Our results revealed the presence of 140 MOTUs, 130 of them are congruent with morphology-based species delimitation. In two cases, barcodes did not permit to differentiate between two morphotypes while they unveiled cryptic diversity within six other species, calling for further taxonomic investigations. This study provides a comprehensive core-list of sh taxa in Merbok Estuary, demonstrating the advantages of combining morphological and molecular evidence to describe diverse but still poorly studied tropical sh communities. It also delivers a large DNA reference collection for brackish shes occurring in this region which will facilitate further biodiversity-oriented research studies and management activities. Our results show that DNA barcoding (using COI gene) and morphology-based approach accurately converge on the delimitation of 131 species (about 97% of the examined species) in Merbok Estuary region. DNA barcoding approach further revealed possible cryptic diversity within six species whereas it did not detect any difference between two pairs of species, calling for taxonomic revisionary studies.


Introduction
Estuaries and coastal wetlands which feature mangrove ecosystem are transition zones that link terrestrial and freshwater habitats with the sea 1 . Mangrove ecosystem delivers essential ecosystem services, including shoreline protection, nutrient production and sheries resources. In consequence, mangrove ecosystem plays a vital role in supporting local communities' socio-economic pursuits 2 . Unfortunately, such crucial human-nature relationship is threatened by habitat pollution, destruction, and over shing 3 . It is also impacted by other factors such as species invasion, and climate change 1 .
The less disturbed tropical estuaries, especially their mangrove area, generally harbour rich, diversi ed and complex faunal communities 4 , combining the presence of salinity-tolerant resident species along with regularly or occasionally frequenters. Frequenters include mainly marine species that use this ecosystem either to feed, shelter, breed, or nurse their young 5 . Describing and monitoring the biodiversity in these ecosystems is primordial for long term sustainability because it is considered that biodiversity ensures stability and resistance towards any disturbance or potential invasion through complex speciesspecies interactions 1 . However, biodiversity is still poorly documented in many mangrove ecosystems that hampers further research on their functioning and management.
Malaysia is recognised as one of the world's biodiversity hotspots with astounding levels of diversity and endemism 6,7 . Considering only shes, 8 reported the presence of a total of 1418 brackish and marine species in Malaysian waters, occupying various coastal habitats, including mangroves and estuaries. The remaining coastal mangroves in Malaysia (Peninsular Malaysia and Malaysian Borneo) occupy less than often occupy estuarine banks in saline tidal areas. One of the largest remaining intact patches of mangrove forests is located within the Merbok Estuary (north-west of Peninsular Malaysia, Figure 1). The estuary was gazetted as a permanent forest reserve, the Sungai Merbok Mangrove Forest Reserve in 1951, and is the second largest mangrove forest in Peninsular Malaysia after the Larut Matang Forest Reserve. The Merbok Estuary and its surroundings are a dynamic and productive ecosystem that supports the world's highest mangrove species diversity per unit area within a contiguous habitat, with 39 of the estimated 70 true mangroves species described globally 10 . This area also represents important resource grounds for local populations 11,12 . Due to its biological, ecological, and socio-economic importance, the Merbok Estuary has been the focus of research in the 21 st century, including some biodiversity inventories (trees and gastropods 13 ; shrimps 11 ; shes 12,14 ; mangrove trees 10 ). The latest ichthyological survey has inventoried up to 138 sh species (from 47 families) in the estuary, demonstrating a rich sh fauna 15 . However, because of taxonomical uncertainties using only morphological characteristics, the identi cations of some species are challenging, especially for some speciose families such as Mugilidae, Gobiidae or Eleotridae 15 .
Furthermore, cryptic diversity is frequently encountered in tropical highly biodiverse regions 16,17 , and it is likely that some morphology-based species hide more than one species. In Merbok as elsewhere, a precise account of species diversity is a necessary requirement for further researches and numerous studies have highlighted the complementarity between morphological and molecular approaches to reveal biodiversity [18][19][20] . To date, there is no attempt to compare morphology-based results on sh diversity with molecule-based approach in this mangrove species-rich Merbok Estuary.
Since its introduction in the past decades, DNA barcoding has emerged as the global molecular taxonomic method across shes based on a de ned molecular marker, a ~650 base pairs long fragment of the mitochondrial cytochrome oxidase I gene (COI) 21 . Several regional DNA barcoding studies have demonstrated its e cacy to delimitate sh species, for instance, in Australia 22 , South China Sea 23 , Indian Ocean 24 , and Indo-Paci c coral shes 25 . DNA barcoding has proven to be a reliable method in identifying cryptic and potentially new sh species [26][27][28][29] , marine larval shes [30][31][32][33] , and food origins 34 .
In this study, we assemble a large reference library of DNA barcodes of 350 sh individuals from Merbok Estuary and its adjacent waters for the purpose to describe the sh diversity in this region in providing a critical look at previous morphology-based results. Such species inventory approach has found wide applicability in a plethora of research that hinges on reliable species identi cation, from ecosystem health management, biodiversity conservation, and aquaculture to shery management 35,36 . All of these uses pertain to the Merbok Estuary.

Fish Diversity
A total of 350 specimens (out of 441 collected) were successfully sequenced for the COI gene, representing 135 morphological species, 94 genera, 47 families, 17 orders, and two classes, Chondrichthyes and Actinopterygii (species list shown Table 1). Two of these species (i.e. Cryptocentrus sp. and Johnius sp.) were only identi ed to the generic level using morphology whereas for three other species, Dichotomyctere cf. uviatilis, Brachygobius cf. kabiliensis, and Lagocephalus aff. Lunaris, we used open nomenclature.

DNA-based identi cation
Sequence length for all 350 generated barcodes was longer than 600 bp with no indels or stop codon detected. The nucleotide composition showed a mean percentage of 18.32% (G), 27.97% (C), 24.07% (A), and 30.7% (T). More than half of the species (56%, 76 species) were represented by multiple specimens while 59 species were represented by a single specimen (Table 1). Mean number of specimens per species was 2.59. Increment in the K2P genetic divergence was directly related to the hierarchical taxonomic relationship: within species mean divergence = 0.85% (SE=0.01), within congeners mean divergence = 16.7%, (SE=0.01) and within families mean divergence = 18.17% (SE=0) ( Table 2).
Both Bayesian Inference (BI) ( Figure 4) and Maximum Likelihood (ML) ( Figure S1) trees exhibited wellde ned clusters at the level of orders and families, with minimal differences in topologies. The BI tree was better resolved and used to represent the species delimitation ( Figure 4). The Molecular Operational Taxonomic Units (MOTU) delimitation analyses yielded variable, but relatively high number of MOTUs for each method. The RESL analysis revealed 139 MOTUs assigned to dedicated BINs. Similarly, the ABGD analysis also identi ed 139 MOTUs (P=0.0010-0.0599) within the initial partition for all substitution models (Table S2). The single-threshold GMYC analysis recognised 140 MOTUs that were taxonomically concordant with the identi cation by the other two analyses except for one species, Hyporhamphus quoyi, that partitioned into two MOTUs. Several incongruences between MOTUs and morphology-based species delimitation are highlighted in Figure 4 (red bars) involving multiple species, as presented in Table 3.

Species delimitation methods
One of the premises of DNA barcoding is the detection of the so-called "barcode gap", which can be observed when the intraspeci c divergence is smaller than the interspeci c mean distances 39 . Absence of gap between two morphological species is indicative that these species represent only different forms within one species whereas the presence of a gap within a morphological species is evidence for specieslevel cryptic diversity 40 .
Employing multiple methods and schemes in clustering (using barcode gaps) the generated DNA barcodes provide an e cient approach in identifying putative species. Even though these methods may have individual pitfalls, especially in analysing singletons, they can yield a robust outcome when combined 41 . Our study utilised three different methods in clustering the sequences into MOTUs, namely Re ned Single Linkage (RESL), Automatic Barcode Gap Discovery (ABGD), and Generalized Mixed Yule Coalescent (GMYC). Despites different analytical assumptions supporting each method, all three methods yielded similar results: RESL and ABGD analyses identi ed each 139 MOTUs whereas the GMYC analysis identi ed 140 MOTUs within our dataset. These results demonstrate a robust pattern of MOTUs in our dataset; even the GMYC method which is known to overestimate MOTUs counts compared to other methods 42 , delimitated only one additional MOTU. Because both RESL and ABGD analyses had closer correspondence to the number of species de ned by morphological identi cation, we based our discussion on species account on these two methods.
Our results show that DNA barcoding (using COI gene) and morphology-based approach accurately converge on the delimitation of 131 species (about 97% of the examined species) in Merbok Estuary region. DNA barcoding approach further revealed possible cryptic diversity within six species whereas it did not detect any difference between two pairs of species, calling for taxonomic revisionary studies.
The mean conspeci c K2P divergence (0.85%) was 20-fold lower than the mean congeneric divergence (16.7%). This increase in genetic divergence with increment in taxonomic levels is logical and expected 35 . However, both mean genetic estimates are higher than those previously recorded in other regions. Most molecular assessment of marine shes displayed conspeci c divergence within the range of 0.25-0.39% whereas congeneric divergence were within the range of 4.56-9.93% [22][23][24]36,43 , but 25 found similar pattern of high average conspeci c and congeneric divergence within the Indo-Paci c coral reef shes (1.06% and 15.34%, respectively).

Taxonomic conundrum
We observed high intraspeci c COI variability in six species, namely Eleutheronema tetradactylum (16.66%), Osteomugil perusii (14.24%), Planiliza subviridis (13.44%), Deveximentum indicium (9.05%), Gerres oyena (4.29%) and Lutjanus russellii (4.12%). Such high intra-speci c genetic divergence suggests either misidenti cation or the presence of more than one species 25,44 . We dismiss the rst possibility because the morphological examination of incriminated specimens, based on existing keys, seems consistent. Therefore, such morphology-molecules incongruence may more likely be the signal of hidden diversity. Large genetic differentiation has been reported in E. tetradactylum among allopatric populations within the Indian Ocean 45 . Our results are consistent with this study, further indicating that differentiation in this species is not only allopatrically but, also, sympatrically distributed. Recent molecular taxonomic studies on the family Mugilidae in which are included O. perusii and P. subviridis, evidenced a very high level of cryptic diversity in the Indo-West Paci c region 46,47 . Several valid mullet species are actually complexes of several morphologically similar species for which extensive taxonomic revisions are needed. The taxonomy of D. indicium (family Leiognathidae) is still in ux with continual descriptions of new species 48 . Our results indicate the presence of two sympatric species under D. indicium in Merbok Estuary. Gerres oyena and L. russellii exhibit intraspeci c differentiation of lower magnitude than those observed for the rst four species discussed above, although still well above the threshold of 2%. Lutjanus russellii natively occurs in this region 49 but it is also farmed in Merbok estuary. Aquaculture activities regularly import non-native seeds from various sources, with no or poor records of origins, making it di cult to determine their origins. The divergence observed within this species (4.12%) could be the consequence of the presence of both native and alien (escaped from aquaculture farms) individuals in Merbok estuary 15 .
Two cases of shared MOTUs between species were detected involving the pairs Alepes melanoptera and Caranx sexfasciatus (BOLD:AAB5775), and Dichotomyctere nigroviridis and Dichotomyctere cf. uviatilis (BOLD:AAF2344). Hybridisation among closely related taxa, incomplete lineage classi cation of a recent, on-going speciation event, or intraspeci c morphological variability could account for this observation 50 .
Similarly, 51 who studied the sh diversity in the Parnaíba Delta, opined that a lack of taxonomy consensus may be observed when the rate of molecular variation does not accompany recent sympatric speciation event that lead to morphological differentiation.

Establishment of DNA barcoding library
Precise identi cation of organisms is a prerequisite for assessing the biological and ecological status of an aquatic ecosystem. The current study illustrates yet another example of the complementarity of the morphological and molecular techniques to achieve this goal. DNA barcoding offers a non-invasive approach in aquatic diversity assessment and requires minimal expertise in conventional taxonomy 52,53 . Comprehensive DNA barcode reference library is crucial in any biodiversity assessment for providing selective autecological and biogeographic information for comparative analysis with previous assessment. Even though DNA databases like BOLD 54 and GenBank 55 are publicly available, a localised taxon-speci c reference library is synoptically important as it is easier to curate and is a more practical reference for a focused site. The DNA barcodes reference library associated with voucher collections established in our study broadens the research spectrum in biological evaluation and biomonitoring effort. Future research endeavours will utilise this database to assess this habitat's ecological and health status through eDNA metabarcoding. The barcode data generated in this study will de nitely contribute to the local as well as regional conservation efforts of sh diversity. Notwithstanding, to improve the resolution of the taxonomic coverage, the number of DNA barcodes for the singleton specimens should be increased through more intensive sampling and increased number of sites within the estuary.
Of the 135 species examined in this study, 61 species (~45%) were identi ed with high commercial value 56 thus, protection planning, and proper shery management of these species are vital. Furthermore, we manage to barcode an invasive species -the Mozambique tilapia, Oreochromis mossambicus that were sampled within our studied site.
We DNA barcoded a rich and diverse mangrove-related sh assemblage. Of the 135 species initially identi ed based on morphology, barcodes of 131 species support their validity. We found hidden diversity within six species whereas the divergences between two pairs of valid species are below the interspeci c threshold standard raising questions on their validity and calling for further taxonomic studies. Although certainly not complete, the establishment of a local DNA barcodes reference library is an essential step for future studies of sheries, conservation and ecological management of this important site.

Ethics statement
This project was conducted according to the relevant national and international guidelines and did not involve any endangered or protected sh species. All sh specimens were either collected from the local shermen, caught using non-invasive shing gear by the authors, or bought from the local market. This study was carried out following the recommendations and approval by the Universiti Sains Malaysia Animal Ethics Committee.

Sample collection
A total of 441 specimens were sampled between December 2018 to October 2019 at multiple locations along the Merbok Estuary and its vicinity (Figure 1). Specimens were collected either from local shermen (who use the barrier-net method locally called 'pompang'), direct sampling by dip-net or bought from the major sh landing site (Kuala Muda Whispering Market). All specimens were caught within Merbok River and its adjacent waters. Samples collected from the sh landing site were retrieved from shing vessels that operate within Zone A (from the shoreline up to 5 nautical miles) and Zone B (from 5 to 12 nautical miles) 57 . Information on the sampling localities (geographical coordinates) is shown in Table S1. Other collection data -dates, taxonomy and details of voucher specimens can be retrieved from the online project datasheet implemented in BOLD with project code -DBMR.
Sample processing and morphological identi cation A n clip from each fresh specimen was taken and stored in 90% ethanol. Voucher specimens were xed in 10% formalin for at least one week and then transferred into 70% ethanol for long term storage. All specimens were catalogued and deposited at the Museum of Biodiversity, Universiti Sains Malaysia.
Morphology-based species identi cations and nomenclature follow 15 . We were unable to unequivocally assigned few specimens to a valid described species using available keys. In these cases, we used either "sp.", "cf.", or "aff.".

Laboratory analyses
Genomic DNA was extracted using DNeasy Blood & Tissue kit (Qiagen, Germany) following the given protocol of animal tissue DNA extraction. The purity and concentration of the isolated DNA were measured using a microvolume UV spectrophotometer (Quawell Q300, Quawell, CA) and stored at -20°C until further use. An approximately 650 bp fragment of the mitochondrial COI gene region was ampli ed using the combinations of the following primers previously designed by 22 : Each sample was ampli ed in a nal volume of 25 µL, containing 5.5 µL of 5x MyTaq™ Reaction Buffer Red (Bioline GmbH, Germany), 0.5 µL of each primer (100 ng/µL), 0.25 µL 5U Taq polymerase (iNtRON Biotechnology Inc., Korea), 2.5 µL of genomic DNA (50 ng/µL) and adequate nuclease-free water to complete the nal reaction volume. Each ampli cation set was performed with the inclusion of a negative control (no template DNA) with thermal cycling conditions as follows: initial denaturation at 94°C for 4 minutes; followed by 35 cycles of denaturation at 94°C for 30 seconds, annealing at 48°C for 50 seconds, and extension at 72°C for 1 minute; then a nal extension at 72°C for 10 minutes. The PCR products were then fractioned by 2% gel electrophoresis to check for successful ampli cation. All positive ampli cations were then sent for puri cation and sequencing to Apical Scienti c Sdn. Bhd. (Selangor, Malaysia) operating the ABI PRISM 3730XL automated sequencer and the ABI PRISM BigDye terminator cycle sequencing kit v3.1 (Applied Biosystems, Foster City, CA). Bidirectional sequencing was employed to decrease the probability of sequencing errors.

Data analyses
Each generated chromatogram was manually screened prior to DNA alignment in MEGA X 58 . The sequences were proofread and independently aligned and then inspected for deletions, insertions and stop codons using the same software. All sequences have been uploaded in BOLD 54 and deposited in GenBank 55 (Accession nos. MW498499 -MW498843).
A total of 350 COI sequences were determined in this study. To assess the taxon discrimination between all specimens, pairwise genetic distances were calculated within and between species, genera, and families based on the Kimura 2-parameter (K2P) distance model 59  within the BOLD platform using the RESL algorithm 65 to assign sequences to a dedicated Barcode Index Numbers (BIN). Next, the ABGD 39 analysis was run at the webserver (https://bioinfo.mnhn.fr/abi/public/abgd/abgdweb.html) to census divergence within the analysed dataset for species delimitation. The ABGD analysis was run with the following settings: relative gap width X=1.0, intraspeci c divergence (P) values range from 0.001 to 0.0059 for all the distance metrics, while all other parameter values were kept as default. Finally, the GMYC method 66 was employed with the fully resolved, BI ultrametric COI tree (see above for the reconstruction method). A single-threshold GMYC analysis was run in RStudio 67 with the 'splits' package.