Taxon sampling
One hundred sixty-seven individuals (see Supplementary Table S1 online) including 114 cultivated C. tinctorius and 53 individuals of wild species of Carthamus, i.e. C. lanatus (18), C. oxyacantha (8), C. glaucus (6), C. palaestinus (7), C. tenuis (6), and C. boissieri (8), were included in GBS-based phylogenetic analyses. A set of eight diverse individuals of C. tinctorius was used together with individuals of C. palaestinus, C. oxyacantha and C. glaucus for population structure analyses.
The analyzed materials were obtained as seeds from the gene bank collections of RTIPP-SBUK (Research and Technology Institute of Plant Production, Shahid-Bahonar University, Kerman, Iran), IPK (Institute of Plant Genetics and Crop Plant Research, Gatersleben, Germany) and NPGS (National Plant Germplasm System, U.S. Department of Agriculture, USA) and grown at least to a four-leaf stage before harvesting leaves for DNA extraction. Detailed information such as passport data and observation records for each accession can be found in the online GRIN (Germplasm Resources Information Network) database of NPGS (http://www.ars-grin.gov/npgs/) and Genebank Information System of the IPK Gatersleben (https://gbis.ipk-gatersleben.de/). Collection and use of plant materials was conducted in accordance with institutional, national and international regulations. Seed exchange followed the Nagoya rules of access and benefit sharing through standard material transfer agreements.
DNA extraction and genotyping-by-sequencing (GBS)
Genomic DNA was extracted from young leaves with a DNeasy Plant kit (QIAGEN). To obtain genome-wide SNPs, we used a two-enzyme GBS protocol that uses the restriction enzymes PstI (NEB Inc.) and MspI (NEB Inc.) to digest 200 ng of genomic DNA. For the library preparation and individual barcoding, the protocol of Wendler et al. [36] was followed. The individual libraries were sequenced in multiplex on an Illumina HiSeq 2500 (100 bp single-end reads) at the sequencing facilities of Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Germany.
GBS data preparation and filtering
Quality assessment of all raw sequence samples was performed using fastqc (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). Sequences were de-multiplexed allowing for one base mismatch in barcodes, and the restriction site and barcode were trimmed from all GBS sequence reads using cutadapt [37]. Sequence data were filtered and clustered using the ipyrad bioinformatics pipeline [38]. We conducted two runs of ipyrad: (i) one with a small set of C. tinctorius accession but including seven wild Carthamus species, and (ii) using 114 individuals from C. tinctorius plus its closest three relatives according to the prior analysis. The minimal number of samples to possess a certain locus was set to 90% of the individuals. The clustering threshold of reads within and between individuals was set to 0.85 and 0.90 in the first and second dataset, respectively. For the other parameters the default settings were used. The assembly of the GBS data was done de novo.
Phylogenomic analyses
After removing samples with a too low number of reads, we used MrBayes 3.2.6 [39] and Paup* 4a169 [40] to infer phylogenetic relationships within our set of Carthamus species and accessions. NJ, based on GTR distances, and MP trees were calculated in Paup* to analyze the aligned sequence data matrices. In MP gaps were treated as missing data and we used a heuristic search with TBR branch swapping with initially 250 random addition sequences (RAS) restricting the number of collected trees to 50 per replicate. The resulting shortest trees were then used as starting trees in a second heuristic search without setting a limit on tree numbers. Bootstrap support values were obtained by 500 bootstrap re-samples using the same settings as before except that we excluded RAS. Paup* was also used to infer the model of sequence evolution.
For BI two times four chains were run for 5 million generations for the dataset including the set of different Carthamus species, specifying the model of sequence evolution as GTR+Γ. We sampled a tree every 1000 generations and summarized the trees in MrBayes. Converging log-likelihoods, potential scale reduction factors for each parameter and inspection of tabulated model parameters in MrBayes suggested that stationary had been reached. The first 25% of trees were discarded as burn-in. For the large dataset including all C. tinctorius accessions, phylogenetic inference calculations in MrBayes were run for more than three weeks without converging chains when we cancelled this analysis.
Population structure analysis
The dataset of SNPs generated from ipyrad was used as input for the analysis of population structure for C. oxyacantha, C. palaestinus, C. glaucus and eight C. tinctorious cultivars in the R package LEA [41]. The number of subpopulations (K) was assumed to range from one to seven. An admixture analysis [42] using sparse nonnegative matrix factorization (snmf) was used for the estimation of population genetic structure. In addition, a Principal Components Analysis (PCA) was conducted in ipyrad on the same dataset.
Analysis of genome size
To obtain ploidy levels we used flow cytometry to measure genome sizes of Carthamus species [43]. For this, we collected a leaf from each of three individuals from eight C. tinctorius and four C. oxyacantha accessions and three leaves from one C. palaestinus individual. Afterwards the leaves were transported to the lab and genome sizes were measured on a CyFlow flow cytometer (Sysmex Partec GmbH) using propidium iodide (PI) as staining reagent (Sigma-Aldrich) according to Jakob et al. [44]. The tomato cultivar ‘Stupicke’ was used as the size standard (2C DNA content = 1.96 pg [45]).