African-specific prostate cancer molecular taxonomy


 Prostate cancer is characterised by significant global disparity; mortality rates in Sub-Saharan Africa are double to quadruple those in Eurasia1. Hypothesising unknown interplay between genetic and non-genetic factors, tumour genome profiling envisages contributing mutational processes2,3. Through whole-genome sequencing of treatment-naïve prostate cancer from 183 ethnically/globally distinct patients (African versus European), we generate the largest cancer genomics resource for Sub-Saharan Africa. Identifying ~2 million somatic variants, Africans carried the greatest burden. We describe a new molecular taxonomy using all mutational types and ethno-geographic identifiers, including Asian. Defined as Global Mutational Subtypes (GMS) A–D, although Africans presented within all subtypes, we found GMS-B to be ‘African-specific’ and GMS-D ‘African-predominant’, including Admixed and European Africans. Conversely, Europeans from Australia, Africa and Brazil predominated within ‘mutationally-quiet’ and ethnically/globally ‘universal’ GMS-A, while European Australians shared a higher mutational burden with Africans in GMS-C. GMS predicts clinical outcomes; reconstructing cancer timelines suggests four evolutionary trajectories with different mutation rates (GMS-A, low 0.968/year versus D, highest 1.315/year). Our data suggest both common genetic factors across extant populations and regional environmental factors contributing to carcinogenesis, analogous to gene-environment interaction defined here as a different effect of an environmental surrounding in persons with different ancestries or vice versa. We anticipate GMS acting as a proxy to intrinsic and extrinsic mutational processes in cancers, promoting global inclusion in landmark studies.


Patient cohort
Our patient cohort was comprised of 183 patients from Australia (n=53), Brazil (n=7) and South Africa (n=123) and presenting mostly with clinicopathologically confirmed prostate cancer. All except one Australian patient (PID 15178) treated with one-month-long Ozurdex therapy were treatment naïve at time of sampling. Three patients were unconfirmed for the cancer and confirmed for benign prostate hyperplasia (BPH). All men from the Southern African Prostate Cancer Study (SAPCS) were recruited at the time of diagnosis, and therefore tumour tissue was derived from biopsy core, while age and PSA levels were recorded at the time of diagnosis. Australian and Brazilian subjects were recruited at the time of radical prostatectomy. Their ages and PSA levels were also recorded at the same time. Additional selection criteria included: availability of fresh-frozen tissue and matched blood, self-reported ethnicity and country of origin, as well as availability of clinical and pathological data (Supplementary Table 1).

Ethics
All samples were obtained with written informed consent, as per study approval  Table 1).

Hayes, Rosemarie Sadsad
The Sydney Informatics Hub (SIH), Core Research Facilities, University of Sydney developed the whole-genome sequencing analysis pipeline used in this study and optimised these pipelines for the University of Sydney's High Performance Computing cluster, Artemis, and Australia's National Computational Infrastructure (NCI), Gadi High Performance Computing facility.

Quality control
QC-tools can be used to quality check raw sequencing files. Supplementary Table 1 shows MultiQC reports of raw sequencing reads (FASTQ format) from Kinghorn Centre for Clinical Genomics (KCCG), Garvan Institute of Medical Research.

Variant discovery pipelines
6 Whole-genome sequencing data from 190 patients admitted in prostate cancer clinics (380 tumour and blood samples) were analysed at scale using four key pipelines ( Supplementary Figures 1 and 2): i) data pre-processing for variant discovery, ii) germline short variant discovery, iii) somatic short variant discovery, and vi) structural variant discovery. The pipelines used either physical data chunking (Pipeline 1) or genomic interval chunking (Pipelines 1, 2, and 3) to divide the data for massively parallel processing. For jobs (Pipeline 4) where physical or interval chunking were not biologically valid, we implemented a parallel-by-sample approach. Key algorithms that consumed the most compute resources will be discussed below.

Pipeline 1: Data pre-processing for variant discovery
Data pre-processing for variant discovery is executed using the Fastq-to-BAM v2.0 pipeline 1 . Each stage of this pipeline is described in detail here. To prepare raw reads from the KCCG sequencing centre for this data pre-processing step, FASTQ files were separated by sequencing lane, using fastqsplit 9 (https://github.com/supernifty/fastqsplit). Reads were adapter-trimmed and filtered using TrimGalore v0.6.5 to remove low-quality bases (<Q15), short reads (<70bp), and missing read pairs. We also removed sequins, the synthetic DNA spike-in controls added during sequencing with Anaquin v3.9.0 2 .
Each lane of filtered reads was aligned against human reference hg38 + alternate contigs using bwa v0.7.15 3 . BWAkit and BWA-mem functions were used concurrently to improve mapping quality scores for the primary human reference genome and a list of alternative haplotypes, and therefore enable variant calling. Each read pair was aligned as an independent entity, where we parallelised this job by first splitting the FASTQ files into smaller files of 500,000 reads (Job 1.1). These read data are homogenous in size and each ran independently (Supplementary Figure 3).
The scattered alignment tasks were merged with multi-threaded Sambamba v0.7.0 to produce one BAM per sample containing information about reads mapping to the human reference. For analysis-ready BAM used in the following pipelines, duplicate reads (technical artefacts from the sequencing process) were marked using SAMBLASTER v0.1.24, and systematic error correction on the quality scores of DNA sequencing was performed using Base Quality Score Recalibration, GATK v4.1.2.0 (BQSR; Job 1.6).
As per a scatter-gather method, recalibrated tables of each sample were generated on 3,367 contigs (3,366 natural hg38 contigs and one unmapped group). The final job merged the scattered, recalibrated interval BAM files into an analysis-ready BAM file per sample with the Sambamba program.
For the last part of Pipeline 1, contaminated and mislabelled samples were estimated using qSignature v0.1 (Job 1.11). In this study, six tumour samples were removed for their comparison scores greater than 0.2 and two patients were duplicated based on the average Euclidean distances of 0.025-0.038 (qSignature distances) (Supplementary

Pipeline 2: Germline short variant discovery
Germline short variant discovery is executed using the Germline-ShortV v1.0
Each stage of this pipeline is described in detail here. According to somatic short variant discovery (SNV and Indel) best practices by the BROAD Institute (GATK v4.1.2.0), the analysis-ready BAM files described in Section 4. The variants were filtered out for misalignment, strand and orientation bias, polymerase slippage, germline variation, and contamination. The GATK LearnReadOrientationModel read the f1r2 files generated from Mutect2 on interval to identify and filter out erroneous variants with higher frequency in one read pair orientation. The contamination was estimated with GATK CalculateContamination (Job 3.11), using the outputs of GATK GetPileupSummaries (Job 3.19), which summarised information of reads supporting known variants. The final step of filtering took account of all the information and evaluation from Job 3.8-3.11 using GATK FilterMutectCalls (Job 3.12).

Pipeline 4: Structural variant discovery
Somatic copy number alterations for each tumour were identified using CNVkit v0. Pipeline 4 was searched against the whole genome at once for genomic rearrangements observed across different chromosomes. As it was not biologically valid for scatter-gather parallelism, we performed multithreading options available in our pipeline tools and applied a parallel-by-sample approach.

13
Computation for the study was performed on three High-Performance Computing (HPC) systems (Table 1). All massively parallelised computation was performed using the National Computational Infrastructure (NCI) Gadi.  Chromoplexic chains were plotted using the Circos software (http://circos.ca).

Recurrent structural variation (SV) breakpoints
Recurrence analysis of SV breakpoints among 183 tumours was performed using fishHook v0. In addition, recurrent somatic juxtapositions of SV breakpoints in this study (2-dimensional connections between distinct genomic loci) were investigated using Ginseng 13 (https://github.com/walaj/ginseng). No significance was found.

PCAWG somatic drivers
Known driver genes in coding and noncoding regions published in PCAWG 15,23,24 were explored in our 183 tumours. Those specific to prostate cancer genes were also included 21,25-28 ; 70 more driver genes were collected. Among those 1,730 drivers, 649 were found in our cohort across the 11 genomic elements described in Section 7.1.
Moreover, significantly recurrent breakpoints and juxtapositions spanning within 311 genes reported in PCAWG were searched in this study 15 . Visualising the top 300 cancer genes significantly mutated in Sections 7.1-7.4 was carried out using maftools v2.2.10 29 .

Tumours with no apparent drivers
All our 183 prostate tumours have recurrent alterations, regardless of mutational types (Extended Data Fig. 1c). This might be the result of our focus on African samples with high-risk prostate cancer. About 53 patients did not have PCAWG coding drivers observed (derived from hg19; Extended Data Fig. 2a), although three of them had the drivers if using hg38 annotation data (Supplementary Table 2). Nine patients in our cohort showed only recurrent CNAs, without any point mutations, indels and SV breakpoints detected.

Integrative clustering analysis
For a prostate cancer taxonomy purpose, integrative clustering using iClusterPlus 26,30 was computed based on whole-genome information of 183 patients, including simple somatic mutations, SV breakpoints and somatic copy number alterations. We considered binary features of significant and known driver genes based on SSM and SV data to indicate the presence or absence of a driver in each sample (Section 7) and generated cohort-wise segmentation data for CNA data (10-kb binning; Job 4.1). The SSM and SV data were normalised by gene; the CNA data chosen within non-redundant regions (epsilon=0.005; rmSmallseg=TRUE) were set with an adaptive dimension reduction. All of the data types were integrated into a single feature matrix for 183 patients. Bayesian information criteria (BIC) were selected for the best sparse model in our integrative analysis for molecular classification. Following the TCGA 26 , molecular taxonomy comprising seven groups of oncogenic drivers was also compared in this cohort.
Individual consensus clustering of each whole-genome data across different tumours was performed, using ConsensusClusterPlus v1.50.0 31 in R. It evaluated a maximum of 20 clusters, with 1,000 iterations of hierarchical clustering and 80% subsampling.
Euclidean distance was used with Ward's method for hierarchical clustering. The SSM and SV normalised data described above were run, and segmented copy number data from tumours for the clustering were converted to a data matrix of overlapping chromosomal regions comparing all the possible sample pairs using CNTools v1.42.0 in R.

Statistical significance of prostate cancer subtypes
Statistical associations among diverse data types of recurrent alterations in this study ( For multiple-testing bias, the P-value was adjusted for a false discovery rate (FDR) using the Benjamini-Hochberg correction (BH).

20
In addition, gene-centric integration of significantly mutated genes observed above at any test was collated across SSM, SV and CNA alterations and verified its association with our prostate cancer subtypes. The integrated data applied GISTIC results of log2 copy number per gene and sample that were then adjusted to either -0.20 or +0.20 if additional driver genes and recurrent SV breakpoints were present in a sample. The adjustment also considered regular copy number changes per gene and/or prostate cancer subtype if uncertain. The copy number adjustment of the integrated data at -0.10 or +0.10 was also tested for an association with prostate cancer subtypes for a comparative purpose. The adjusted log2 copy number was treated as a dependent variable for the ANOVA analysis. Note that the threshold at either ±0.1 or ±0.2 was identical for the results of genes preferentially mutated in specific tumour subtypes, except for one fewer gene for the latter.

Pathway and network analysis
The genes preferentially mutated in specific tumour subtypes mentioned above were used for the discovery of enriched pathways using ActivePathways v1.0. Reactome databases.

High-risk CPGEA
To compare molecular subtypes within Asian prostate cancer, Chinese Prostate Cancer Genome and Epigenome Atlas (CPGEA, PRJCA001124), which is the largest and most 21 comprehensive cancer genomics study conducted in China 35 , was merged and processed with our integrative clustering analysis (Section 8.1). SSMs and somatic SV breakpoints from 93 high-risk prostate tumours were included and overlapped with our recurrent alterations (Section 7). Instead, GISTIC results for all data in log2 copy number (Section 7.2) from both high-risk CPGEA and this study were merged across 25,988 hg38-annotated genes and integrated with other data types for this analysis.
Percent genome alteration (PGA) was calculated based on the total length of genes defined by the RefSeq database (UCSC GRCh38/hg38, Dec. 2013) and a cut-off at ±0.2.
Gene-centric normalisation was performed in all the three datasets, and unsupervised hierarchical clustering was also run in each of them, using ConsensusClusterPlus v1.50.0 31 in R.

PCAWG
We leveraged the Pan-Cancer Analysis of Whole Genomes (PCAWG) to test tumour subtypes across different ethnic groups in other cancer types, using their SSM and SV consensus callsets and the GISTIC results for all data by gene in log2 copy number were also generated to confirm the lowest approximation error across multiple runs.

SV signatures
The As per the above-mentioned classification, a feature matrix of counts per patient (across 183 patients) of SV clusters falling into different features of SVs split by size and/or replication timing was created and used for the NMF analysis described above to 24 estimate an optimal factorisation rank and, therefore, the number of SV signatures in this cohort.

Statistical analysis of mutational signatures
The

Mutation timing
To improve a cancer timeline estimated above by variant allele frequency (Section

League model relative ordering
League model relative ordering by aggregating the order of driver genes and CNAs across samples to define a probabilistic ranking of the drivers was performed using PhylogicNDT (https://github.com/broadinstitute/PhylogicNDT). Due to different data types the program required at the time of analysis, we skipped the Clustering and SinglePatientTiming modules in the program and created a single aggregated table described for the LeagueModel module. Orderings of each pair of driver genes and CNAs (early, late, unspecified clones and subclones) were derived from the timing of each driver mutation, as well as from the timing status of clonal and subclonal copy number segments retrieved from MutationTimeR. The table of those orderings was aggregated across all samples with the drivers available (n=132/183). The probability of the first event in a pair occurring before the second one would be either '0' or '1'; otherwise, both events were '0' if unknown was '1'. Sports statistics in the League model were employed to calculate the overall ranking of driver events, with the following parameters: n_perms=1000, n_seasons=1000, percent_subset=0.90, and num_games_against_each_opponent=2.

Reconstruction of prostate cancer timelines
Results of the MutationTimeR described in Section 10.2, which gathered per-sample information of both VAF of somatic mutations and marked copy number gains and classified them into four stages (early clonal, unspecified clonal, late clonal and subclonal), were combined across samples in the same tumour subtype to reconstruct a timeline of cancer evolution. We included only driver mutations identified in this study (Sections 7.1 and 7.4) and recurrent copy number overlapped with GISTIC results (Section 7.2), genes preferentially mutated in specific tumour subtypes (Section 8.2) and/or relevant copy number altered within the study by Wedge, et al 28 .
The copy number was overlapped for both alteration size (>50%) and type. The driver genes and copy number reported were present within at least two tumours, with the same timing annotation. Somatic mutations among the four timing periods of the 27 MutationTimeR in our 183 tumours were then profiled for their mutational signatures and compositions at a time, using SigProfiler (Section 9.1).