1. Implementation
1.1 Using Conda, Snakemake and Github to ensure portability, scalability, and reproducibility
Workflow managers allow easy leverage of the computation power of HPC clusters and cloud environments by automating the parallelization process (scalability). By separating environment configuration from bioinformatics analysis per se, workflow managers ease switching from one HPC environment to another (portability), while ensuring that the analysis can be rerun using the exact same tools (reproducibility). Using a workflow manager has many advantages over writing workflows from scratch in generic scripting languages like bash, Perl, or Python. Several workflow managers, each with their pros and cons, are available (for a recent overview, interested readers could refer to [20]). Here we choose to rely on Snakemake to develop our GeCKO workflows [21, 22].
Snakemake, with over a decade of development, is widely adopted in the bioinformatics community. It facilitates workflow modularity and the easy creation of custom environments equipped with the necessary bioinformatic tools through Mamba [23], a faster and more efficient drop-in replacement for the Conda package manager [24]. Additionally, Snakemake is grounded in Python, a language widely utilized in science and particularly user-friendly and adaptable for population geneticists.
While Snakemake also supports Singularity and Docker containers for creating controlled environments, using Mamba seemed to be simpler and more flexible in our case. Snakemake's ability to automate Conda environment creation circumvents the need for container recipes and the challenges associated with container use on HPC clusters, such as non-standard directory structures.
For reproducibility and traceability purposes, our workflows automatically document essential information at each run. This includes the run's date and time, used configuration files, the GitHub commit ID of the current workflow version, and the creation timestamps of output files, all logged in a workflow_info.txt file. Additionally, if the workflow is rerun (e.g., to generate missing files), the information from the latest run is appended to the end of this file. This ensures that users can trace all runs that may have contributed to the generation of any output files, thereby enhancing the traceability and transparency of the research process.
1.2. Using MultiQC to produce informative task reports
Genome-scale data analysis for genotyping numerous individuals generates hundreds of files at each step, making monitoring the overall process challenging. Although workflow managers aid in parallelizing tasks, detecting issues early in the process is often difficult, with problems typically only becoming apparent after the workflow's completion. Traditionally, intermediary quality checks depend on tedious, ad-hoc scripts. MultiQC [25] addresses this by automating the generation of visual HTML reports. It parses output files from various bioinformatics tools, creating informative summaries that are then integrated into a comprehensive report. This report tracks individual sample names and highlights potential issues, such as individuals with very low read coverage that may require re-sequencing, low mapping percentages indicative of potential inter-species contamination, identification of loci not correctly captured by the designed baits, or a drastic drop in the number of SNPs after certain filtering steps. In the GeCKO workflow, MultiQC's capabilities are leveraged to produce detailed reports, enabling prompt identification of these issues.
1.3. A homogeneous architecture for the four workflows of GeCKO
GeCKO comprises four main workflows: DataCleaning, ReadMapping, VariantCalling, and VcfFiltering. Each workflow is configured through three distinct files: one for specifying software release versions, another for cluster-related details (like partition or queue names, memory and CPUs), and a third for workflow-specific parameters (such as input file paths and bioinformatics tool options). The first parameter file (“conda_tools.yml”) can be found in each GeCKO workflow’s WORKFLOW/ENV subfolder. It relies on Mamba to select the software releases to be used, as well as their dependencies. The second file is a cluster configuration file, and needs to be modified to fit the user’s data and HPC cluster environment. GeCKO offers configuration examples for both Slurm [26] and SGE [27] job schedulers, allowing users to specify partitions (for Slurm) or queues (for SGE) and adjust memory requirements for individual tasks. Slurm and SGE templates for this file are provided with each GeCKO workflow’s example dataset (e.g., “DC_CLUSTER_PROFILE_SLURM/config.yaml”).
The final configuration file specifies parameters directly related to data analysis, such as input file paths and options for the bioinformatics tools used in the workflow. For each workflow, GeCKO provides at least one template configuration file (e.g., “config_DataCleaning.yml”), which users can easily customize by replacing default values with their specific parameters. Detailed information about the available options can be found in the GeCKO documentation on GitHub. The configuration files use the YAML format, which offers a user-friendly way to assign values to parameters. An example excerpt from the “config_DataCleaning.yml” file might include:
### GENERAL VARIABLES ###
PAIRED_END: TRUE
### INPUT FASTQ FILES ###
FASTQ_R1: "RAW_DATA/R1.fastq.gz"
FASTQ_R2: "RAW_DATA/R2.fastq.gz"
### TRIMMING PARAMETERS ###
TRIMMING_QUAL: 30
TRIMMING_MIN_LENGTH: 36
The output files of every GeCKO workflow are grouped in a folder named after the selected workflow (e.g., “DATA_CLEANING”), created at the root of the general output folder “WORKFLOWS_OUTPUTS”. When default paths and file names are used, GeCKO workflows can automatically locate the files they need from previous analysis steps, alleviating the need to specify each file path manually (though this option remains available).
1.4 Installing and executing GeCKO workflows
To install GeCKO, one simply needs to clone the GitHub project (https://github.com/GE2POP/GeCKO) onto a computer or HPC cluster with and Conda available (release 22.9.0 or higher). Detailed installation instructions are in the GitHub repository's readme file, which also includes sample datasets and configuration templates for easy setup. A convenient launcher script, runGeCKO.sh, is also provided. To help prevent potential compatibility issues, it is recommended to use the launcher to create a Conda environment that includes Snakemake (currently version 7.32.4), Mamba (currently version 1.4.9) and their dependencies with:
./runGeCKO.sh --create-gecko-env
To execute a workflow, the Conda environment should first be activated with:
conda activate GeCKO_env
Then, the launcher can be invoked, using the “-- workflow” argument to specify the workflow’s name (i.e., DataCleaning, ReadMapping, VariantCalling or VcfFiltering). By default, the script will look for the config_DataCleaning.yml configuration file in a CONFIG folder, but you can specify a different file and path using the script argument “--config-file”. The cluster configuration file is passed with the “--cluster-profile” argument by providing the path to the directory containing the config.yaml file (e.g. “DC_CLUSTER_PROFILE_SLURM”). The script not only executes workflows but also offers Snakemake features for testing workflow consistency (--dryrun), generating usage reports including CPU usage per task (--report), or creating workflow diagrams (--diagram). To execute the DataCleaning workflow on a Slurm-based HPC environment with up to 100 concurrent jobs, the command would be:
./runGeCKO.sh \
--workflow DataCleaning \
-- cluster-profile CONFIG/DC_CLUSTER_PROFILE_SLURM \
--jobs 100
2. Use case
In this section, we detail the methodology used to acquire sequence data, forming a dataset suitable for our use case. The subsequent analysis of this dataset using the GeCKO workflows is described in the Results section, illustrating its practical application.
2.1. Biological material
In this study, we focus on 120 accessions from the three major subspecies involved in the four transition steps of the durum wheat domestication (Figure 1): the wild form T. turgidum dicoccoides (n=30, denoted DD), the first domesticated form with a solid rachis, T. turgidum dicoccum (n=30, DC), and the non-hulled cultivated form, T. turgidum durum (n=60). The fourth step of domestication occurred within the latter subspecies, which is further divided into two groups based on whether the varieties originated in the pre- or post-Green Revolution period. The first group consists of "Landraces" (n=30, DP), which are lines derived from local varieties, and the second group of "elite" varieties (n=30, DE) registered in Europe post-Green Revolution (1970-1990). These 120 accessions were selected from a 314-accession collection [17] to maximize the genetic diversity within each group, using the MSTRAT software [28].
2.2. From biological samples to raw reads
For each of the 120 genotypes, genomic DNA was extracted from 50 mg of plant material. After grinding in nitrogen and cell lysis (extraction buffer with SDS, CTAB), the DNA was purified on a KingFisher™ Flex Purification System (Perkin-Elmer Chemagen metal and magnetic beads). The quality of the extracted DNA was assessed through 1.5% agarose gel electrophoresis and spectrophotometric assay (absorbance at 260/280 nm and 260/230 nm – SPARK10M™ TECAN), while DNA quantity was estimated using a spectrofluometric assay (Hoescht 33258 - InfiniteM200™ TECAN). Genomic libraries were prepared following the protocol presented by Holtz et al. [13], specifically tailored for subsequent target enrichment capture, with two key modifications: i) the ligation of barcodes on both ends of the DNA fragments to eliminate chimeric molecules formed during PCR; and ii) the multiplexing of libraries prior to targeted enrichment, ensuring a uniform number of reads per genotype through precise quantification.
Targeted enrichment capture allows to selectively isolate and sequence specific regions of interest from a genome, using probes (baits) that hybridize to those regions. In our case, we utilized the 20,000 baits designed by Holtz et al., targeting 6,240 SNPs within coding regions across the durum wheat genome. These baits, each 120 nt long, were synthesized by Arbor Biosciences (https://www.arborbiosci.com/). Following Holtz et al.’s protocol, target enrichment capture was conducted using the SeqCap EZ Hyb and Wash kit (Roche). Given the complexity of the durum wheat genome and the minuscule proportion represented by the targeted regions (about 0.1% of the genome), we optimized the capture process by: i) carrying out two successive capture phases to enhance hybridization specificity and, consequently, enrichment efficiency, and ii) using blockers to saturate highly repetitive genomic regions and reduce their interference.
Finally, the captured sequences were amplified and quantified before undergoing paired-end sequencing (2x150bp) on an Illumina Hiseq3000™ high-throughput sequencer. To ensure sufficient coverage of the targeted loci, the libraries from the 120 genotypes were divided into two pools, captured and sequenced independently: DEV_Cap009 (62 genotypes) and DEV_Cap010 (58 genotypes). The sequenced data from each capture experiment were delivered as two compressed fastq files (paired R1 and R2). This forms the dataset that will be processed using the GeCKO workflows, as detailed in the subsequent sections.