Betabinomial model of allele frequency in nextgeneration sequencing (NGS)
Our method is designed for human applications and assumes a diploid genome. For each locus that contains a single nucleotide variant (SNV) called from NGS data, we define the allele frequency as the number of counts for the alternative (nonreference genome) allele over the total number of depth. For any diploid genome, if an individual is homozygous for the alternative allele (denoted as alternative/alternative, 1/1), the expected allele frequency is 1; if an individual is heterozygous (denoted as reference/alternative, 0/1) at a locus, 0.5 is the expected allele frequency. These theoretical expectations motivated our use of the binomial distribution for the number of reads at each locus,
where \(n\) is the total number of depth at the locus, \(p\) is the theoretical allele frequency, and \(x\) is the number of counts for the alternative allele.
While a simple model is intuitively appealing, previous studies have discovered extra binomial dispersion, specifically, overdispersion of allele frequency distributions [17–20]. This overdispersion results in higher variability than binomial distribution, so a distribution that models such large variance is needed. Previous studies have demonstrated the betabinomial distribution as an appropriate model for allele frequencies at a particular locus in a subpopulation [21, 22]. The betabinomial distribution is a discrete hierarchical model containing the beta distribution and binomial distribution, where the probability follows the beta distribution and the response follows the binomial distribution. Hence, the probability mass function of the betabinomial distribution is
where \(n\) is the total number of reads at the locus; \(B (a , b)\) is the beta function theoretical allele frequency; and \(x\) is the number of counts for the alternative allele. This model has been applied in several studies, and the advantages of betabinomial distribution compared to binomial distribution when dealing with overdispersion have been repeatedly demonstrated [21, 22]. Prior work using this model motivates our use of the betabinomial distribution.
Quality control of variant call format (VCF) files
The input format for our method is the wellestablished VCF format [23]. To our knowledge, ours is the first method to detect samespecies contamination using VCF. VCF files contain all the SNV information required in the subsequent steps, but quality control is needed to filter noise and unnecessary information. Because they use various algorithms, different variant calling tools generate different allele frequency patterns. It is strongly suggested that the same software is used for training and testing data to ensure that the features in models are consistent and the classification or regression results are accurate. The recommended quality control and processing steps are outlined below and are additional quality control steps beyond the processing conducted to produce the VCF files.
Step 1. Insertion/deletion (indel) filtering
SNVs (not CNVs such as indels) are used as substitution variants. Only substitution mutations result in heterozygous and homozygous genotypes that can be appropriately modeled by the betabinomial distribution. Indels, identified as any mutation segments with a length of more than one base pair, are thus filtered/dropped in this step.
Step 2. Homozygous and heterozygous genotype calling
The genotypes for modeling are then called, generating new information that summarizes the genotype in reference to the alternative allele. For any SNV, there is a homozygous and heterozygous genotype for an alternative allele. Suggested genotypes are listed in the GT (genotype) field of the VCF file, where 0/0 is a homozygous reference, 0/1 is a heterozygous reference (“Het”), and 1/1 is a homozygous alternative (“Hom”). This results in two categories of called variants, each of which corresponds with its own betabinomial model. Homozygous references (0/0) and heterozygous genotypes (1/2, 2/3, and so on) are labeled as “Complex” and are not included in further calculations.
Step 3. Low and highdepth filtering
This step identifies whether a sequence is a true call or a sequencing error by setting thresholds for coverage depth. A reasonable readdepth threshold should be chosen according to the average read depths of a testing sample. Read depths >50 provide acceptable sensitivity and specificity for mutation detection [24].
Step 4. Changepoint detection for CNVs
The features of a pure sample with a CNV region are similar to those of a region with more than one contributor (i.e., samespecies contamination). Hence, the CNV region must be filtered before generating features. If CNVs have already been generated, the function vanquish::defcon() can directly filter the CNV region. Otherwise, a changepoint detection method is used to detect the CNV region. Variances of Ballele frequency (alternative allele frequency) at heterozygous loci have been reported to differ among normal, duplication, deletion, and loss of heterozygosity (LOH) [25]. Therefore, changepoint analysis can be employed to detect the change point of variance (i.e., the border of a copy number region). The changepoint package is applied only for heterozygous positions to search for multiple change points of variance [26].
Distribution and likelihoodbased features
The next step of our approach generates variables/features used in a model to predict samespecies contamination in a sample. Two types of features are generated and used in model building: distributionbased features and likelihoodbased features.
Distributionbased features are generated using allele frequency, which is a real number between 0 and 1. Allele frequency is categorized into four regions, as shown in Fig. 1: low alternative allele frequency (LowRate), heterozygous alternative allele frequency (HetRate), high alternative allele frequency (HighRate), and homozygous alternative allele frequency (HomRate). We used respective cutoff values of 0, 0.3, 0.7, and 0.99 for these regions. Fig. 2 shows the difference between pure and contaminated curves.
Table 1
Classification model features and their descriptions
Name

Description

LOH

Het/Hom, the ratio of heterozygous and homozygous markers within a sample

HomRate

The percentage of the loci in the HomRate region

HighRate

The percentage of the loci in the HighRate region

HetRate

The percentage of the loci in the HetRate region

LowRate

The percentage of the loci in the LowRate region

HomVar

The variance of allele frequencies in the HomRate region

HetVar

The variance of allele frequencies in the HetRate region

The model building steps generate eight distributionbased features, shown in Table 1. These features reflect the distribution of allele frequencies in an entire file, instead of at each variant calling position. Therefore, each input sample/VCF file is represented by one set of features.
The likelihoodbased feature is the average likelihood of all loci in a VCF file, calculated by applying the betabinomial distribution. We used NA10855 and other available pure samples as a reference genome to calculate the maximum likelihood estimator for parameters \(p\) and \(\rho\) in the betabinomial distribution (sequenced at Q2 Solutions). The loglikelihood of all loci are calculated with \(pˆ\) and \(\rho ˆ\), generating their average value.
Support vector machine model
After generating features, a classification method determines whether a sample is from a single or multiple contributors. Utilizing the e1071 R package [27], we apply an SVM model because of the complexity in pattern recognition within the feature space [28]. The SVM method fits a hyperplane between single and multiple contributor regions for optimal classification determination. Since a linear model is not guaranteed, the Gaussian radial basis function (RBF) kernel is used to avoid parametric assumptions. As part of the SVM analysis, the cost and gamma parameters are tuned using the parallel searching method. A grid search is conducted on an exponentially growing sequence of cost and gamma parameters to find optimized paired values. The estimated parameters may differ depending on the training data set.
R package: Variant quality investigation helper
Our novel approach detects samespecies or withinspecies contamination using Ballele frequency from only variant call information. The contamination detection procedure comprises the following steps, also outlined in Fig. 3:
Step 1: The VCF generated by a variant caller is read into R using the vanquish:: read_vcf function. The supported variant callers are GATK, VarDict, and strelka2.
Step 2: CNV regions in the VCF file are detected and filtered using the vanquish:: update_vcf function.
Step 3: Features for the radial kernel SVM model are extracted from each sample using the vanquish::generate_feature function.
Step 4: Parameter cost and gamma for kernel SVM are tuned.
Step 5: Contamination of a test sample is predicted.
The ability of our approach to determine contamination can be affected by two scenarios. First, normaltumor samples comprising a mixture of tumor and normal cells from the same individual may be misclassified as contaminated. Second, for test samples of very low quality, it may be impossible to determine a clear Ballele frequency pattern, so they will not be considered contaminated.
Simulated data test results
To apply our method in real data, we used two reference samples from the 1000 Genomes project [29], NA12878 and NA10855, sequenced at Q2 Solutions. We obtained two pairs of FASTQ format files from sequencing results and resampled and mixed them to different proportions using seqtk [30], as shown in Table 2. For this simulated test, we treated NA12878 as the sample and assumed that NA10855 was mixed into the NA12878 sample at percentages ranging from 0.5–20%. We calculated the detection rate for various levels of contamination. There was a total of 50 million reads for the six mixture samples. Contamination percentages above 5% were readily detected while lower percentages were not (Table 2). Accordingly, the detection analysis has sensitivity above 5% contamination. For contaminants with less similarity to the sample with which they are mixed, the detection sensitivity will be lower; on the other hand, for contaminants with greater similarity, contamination detection will be more challenging.
Table 2
Contamination detection for a simulated data series (M: million).
Sample Component

Reads (NA12878)

Reads (NA10855)

Test Results

NA12878 (80%) + NA10855 (20%)

40M

10M

Contaminated

NA12878 (90%) + NA10855 (10%)

45M

5M

Contaminated

NA12878 (95%) + NA10855 (5%)

47.5M

2.5M

Contaminated

NA12878 (97.5%) + NA10855 (2.5%)

48.75M

1.25M

Pure

NA12878 (99%) + NA10855 (1%)

49.5M

0.5M

Pure

NA12878 (99.5%) + NA10855 (0.5%)

49.75M

0.25M

Pure

Realdata test results
After quantitative simulation testing, we applied the trained model in a set of real data comprising 22 samples. Table 3 displays the range of cell types and samples used, and the results. The samples are ranked by regression values from e1071::svm(). While predictions for 20 of the 22 samples were correct according to prior identification, two humanTlymphoblast samples (see bold text in Table 3) were predicted as pure but were contaminated. In response, we checked the Ballele frequency distribution for these two samples (Fig. 4). The middle area of the CNV pattern was shifted lower from 0.5 to 0.3, indicating the samples were tumornormal cells from the same individual. The distance of the shift in the CNV pattern reflects the percentage of tumor and normal cells in a sample.
We tested the model with a second data set comprising 53 samples. Twelve samples were purposely mixed with a contaminant, and 41 samples were pure. The test results showed sensitivity > 99.99% and specificity of 90.24%. Four falsepositive samples were detected by our method. These false positives were all in formalinfixed paraffinembedded (FFPE) tissue samples that were likely degraded (Fig. 5). The false positives may be because the features generated from a degraded sample are similar to those from a contaminated sample.
Table 3
Contamination detection for a realdata series. Predictions for 20 of the 22 samples were correct according to prior identification. Two human T lymphoblast samples (bold text) were predicted as pure but were contaminated.
Sample Name

Classification

Regression

Prior Identification

Human BLymphocyte L8

1

1.9243094

1

Human BLymphocyte 2 L20

1

1.9209875

1

Human Breast 2 L16

0

1.483925

0

Human Breast L4

0

1.463376

0

Human TLymphoblast 2 L21∗

0

1.3622305

1

Human TLymphoblast L9

0

1.3472358

1

Human Brain L3

0

1.3147938

0

Human Brain 2 L15

0

1.303287

0

Human Testis L12

0

1.245767

0

Human Cervix 2 L17

0

1.2429423

0

Human Testis 2 L24

0

1.2424441

0

Human Cervix L5

0

1.203943

0

Human Macrophage L10

0

1.158416

0

Human Macrophage 2 L22

0

1.1582528

0

Human Liver 2 L18

0

1.1442246

0

Human Liposarcoma L7

0

1.1406007

0

Human Liposarcoma 2 L19

0

1.132044

0

Human Skin 2 L23

0

1.1209464

0

Human Skin L11

0

1.1194772

0

Human Liver L6

0

1.1170909

0

Human Reference DNA Male L1

0

1.0945151

0

Human Reference DNA Male 2 L13

0

1.0906951

0

∗ This is a mixture of tumor and normal cells. See Fig. 2 for the Ballele frequency distribution of this sample.