The use cases demonstrated below include demo input data in the template workspaces, so the R scripts below can run the listed use cases from the local computer. Ready-to-run examples that can be used to test the process on the user’s own AnVIL account are available in the AnVILWorkflow package vignette. GATK best practice pipelines[14] are not demonstrated here, but they are also available as AnVIL workspaces.
## Setup the account
setCloudEnv(accountEmail = {AnVIL account email},
billingProjectName = {AnVIL billing project name})
## Clone the workspace of your interest
newName <- {Unique name for your copy of workspace}
cloneWorkspace(workspaceName = newName, templateName = templateName)
## Run workflow
runWorkflow(workspaceName = newName,
workflowName = {name of the workflow if there is more than one in the workspace of your interest})
## Get workflow outputs
getOutput(workspaceName = newName)
The main features of the demo workspaces and their workflow-specific input data preparation process are described below.
4.1. Bulk RNA sequencing data analysis
Salmon workflow uses AnVIL’s data model and requires four essential inputs - fastq1, fastq2, fasta, and transcriptome index name. This workflow can be easily applied to the consortium data hosted in AnVIL, which follows AnVIL’s data model. With the default runtime environment configured for this workflow (1 CPU, 2GB memory, and 10GB SSD disk), processing 16 demo samples (32 fastq files, ~1GB per file) took about 30 minutes and cost $0.12.
4.2. Whole metagenomic shotgun data analysis
bioBakery is a metagenome analysis environment composed of Python-based tools, reference databases, and command-line-based workflows. It processes raw shotgun sequencing data into microbial community feature profiles, summary reports, and figures[9]. bioBakery’s whole metagenome shotgun (wmgx) and visualization (wmgx_vis) workflows are implemented as an AnVIL workspace. The current version of the AnVILWorkflow supports bioBakery version 3[15]. While users can customize this workflow to a great degree, only six inputs are sufficient to run a standard, optimized version of this workflow. Those six inputs are:
- Name of the Trimmomatic adaptor type (for demo data, NexteraPE)
- Your project name
- Extension of input files (for demo data, .fastq.gz)
- A table of your sequencing file (fastq) names stored in the Google Cloud Storage bucket
- Input file identifier for paired-end sequencing (for demo data, _R1 and _R2)
The seven required databases are already linked to this workflow and nine additional optional inputs are available for further customization. Optional inputs are for workflow customization, such as bypassing functional profiling (default is false) and maximum memory usage for different tasks (default is 32GB for functional profiling by HUMAnN, 8GB for quality control by Kneaddata, and 24GB for taxonomic profiling by MetaPhlAn). This workflow uses call caching and preemptive instances by default for cost efficiency. Processing six paired-end demo samples (mean file size ~ 380MB) with the optimized default setting without using preemptive instances took about 5 hours and cost around $6.50. With the preemptive instances, it can take longer but cost less. Compared to the existing options such as Nephele[16], AnVILWorkflow allows a programmatic approach and more flexible customization options.
3.3. Histopathology image processing using PathML
We implemented the hematoxylin-eosin (HE) stain normalization process of PathML as an AnVIL workspace. This workflow accepts an SVS file as input and returns original and normalized images as PNG files. There are two required inputs - Google Cloud Storage URI where the input SVS image file is stored and the sample name. Processing one publicly available image (CMU-1_Small_Region.svs, 1.8MB)[17] with the default runtime (4 CPU, 16GB memory) took about 8 minutes and cost $0.01. This simple but robust analysis setup can support clinical use cases, such as pathologists who process a large number of images in a short time, by offering guidance and cross-validation options.