Inline index helped in cleaning up data contamination generated during library preparation and the subsequent steps

High-throughput sequencing involves library preparation and amplification steps, which may induce contamination across samples or between samples and the environment. We tested the effect of applying an inline-index strategy, in which DNA indices of 6 bp were added to both ends of the inserts at the ligation step of library prep for resolving the data contamination problem. Our results showed that the contamination ranged from 0.29 to 1.25% in one experiment and from 0.83 to 27.01% in the other. We also found that contamination could be environmental or from reagents besides cross-contamination between samples. Inline-index method is a useful experimental design to clean up the data and address the contamination problem which has been plaguing high-throughput sequencing data in many applications.


Introduction
High-throughput sequencing has become a dominant and efficient data collection method for researches, such as phylogenomics, environmental DNA, and ancient DNA analyses. However, contamination from the environment or crosscontamination from other samples has become a headache in many studies involving high-throughput sequencing [1][2][3][4][5]. Contaminated data may deeply bias results of phylogenetic analysis [2,6], over-estimate branch length and affect orthology detection [7]. Contamination also can confound metagenomic studies [1]. For example, distinct microbial community of placenta was proven to be an artifact due to contamination [8]. In ancient DNA analysis, pollution of modern human DNA can bias the estimates [9], whereas unexpected amplification of negative control samples was often reported in eDNA analyses, due to the complex nature of the eDNA samples [10][11][12].
Contamination can originate from environmental sources, such as extraction kits, plastic consumables and reagents, or accidentally transferred from other samples of the same batch [1,13]. Synthesizing barcodes or outsourcing sequencing can also result in cross-contamination which is beyond the control of the researcher's lab [14]. Many steps of washing and amplification are involved when preparing library for high-throughput sequencing which may increase the risk of contamination. If further manipulation of the samples is applied, such as target-gene enrichment, higher chance of contamination often occur.
Strict decontamination protocols and guidelines for preventing contamination in eDNA experiment have been proposed [15]. A standard checklist also was suggested for metagenomic lab for avoiding contamination [1]. Furthermore, various methods were designed for detecting contamination [14,16] and different pipelines and tools were created to clean up the contaminated data [10,13,14,[16][17][18][19]. Nevertheless, these methods are either using precautions to reduce the chance of contamination or identifying 1 3 contaminator posteriorly though comparing the reads to references or other samples.
Rohland et al. designed incomplete adapters composing P5/P7 barcodes (inline index) and partial Illumina adapters to increase efficacy of target enrichment [20]. They also suggested that the P5/P7 barcodes can be used to trace crosscontamination, however, they did not test this idea. The subsequent research useing their method also focused on the uracil-DNA-glycosylase treatment rather than resolving contamination issue [21,22]. Here, we applied the "inlineindex" strategy on two representative datasets, one from a phylogenomic study and the other for environment DNA data. Our objective is to test with empirical data on how the inline-index strategy can be used to solve the problem of cross-contamination.

Designing and synthesizing the inline indices
Indices of 6 bp were designed using scripts downloaded from the website (https:// bioinf. eva. mpg. de/ multi plex/, Accessed on May 2018). The following filters were applied: (1) editing distance was three, i.e., at least three substitutions should occur for one index changing to another, so to avoid index swap due to amplification or sequencing errors; (2) no polymer index was used, so to distinguish index signal from dust or chemical particles; (3) exclude the following sequences, 'AAA', 'ACA', 'CCC', 'CAC', 'GGG', 'GTG', 'TTT', and 'TGT' to avoid using the nucleotides that can be illuminated with the same laser consecutively. The inline indices were added to the 3' end of the IS1 or IS2 sequences of Meyer and Kircher [23]. Reverse and complementary oligos to the inline indices were added to the 5' of their IS3 sequences (Fig. 1A). Twenty-four pairs of oligos (IS1_IndL and IS3_IndL') with inline indices were designed and synthesized for making P5 adapter and 24 pairs (IS2_IndR and IS3_IndR') were done for P7 adapter ( Table 1). The inline indices were synthesized at Sangon Biotech (Shanghai, China). Protocol of Meyer and Kircher [23] was followed to mix the oligos to make P5 and P7 adapters. It should be noticed that each IS1_IndL and IS2_IndR should be paired with its corresponding reverse and complementary IS3. Extreme precaution should be taken to avoid contamination during index preparation. The prepared indices can be aliquoted to small portions and stored under − 20 °C.  TCT GCC IS1_Ind1 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTtc*t*g*c*c IS3_Ind1 ggcagaAGA TCG GAA*G*A*G*C GTC TCT IS1_Ind2 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTgt*c*t*c*t IS3_Ind2 agagacAGA TCG GAA*G*A*G*C ATA TTG IS1_Ind3 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTat*a*t*t*g IS3_Ind3 caatatAGA TCG GAA*G*A*G*C TGG AAG IS1_Ind4 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTtg*g*a*a*g IS3_Ind4 cttccaAGA TCG GAA*G*A*G*C TCT AGT IS1_Ind5 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTtc*t*a*g*t IS3_Ind5 actagaAGA TCG GAA*G*A*G*C AGA GTA IS1_Ind6 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTag*a*g*t*a IS3_Ind6 tactctAGA TCG GAA*G*A*G*C GGC CAA IS1_Ind7 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTgg*c*c*a*a IS3_Ind7 ttggccAGA TCG GAA*G*A*G*C TAT CTC IS1_Ind8 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTta*t*c*t*c IS3_Ind8 gagataAGA TCG GAA*G*A*G*C TTA TGC IS1_Ind9 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTtt*a*t*g*c IS3_Ind9 gcataaAGA TCG GAA*G*A*G*C AGT TGG IS1_Ind10 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTag*t*t*g*g IS3_Ind10 ccaactAGA TCG GAA*G*A*G*C GTC AAG IS1_Ind11 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTgt*c*a*a*g IS3_Ind11 cttgacAGA TCG GAA*G*A*G*C CAG CAA IS1_Ind12 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTca*g*c*a*a IS3_Ind12 ttgctgAGA TCG GAA*G*A*G*C TCG CCG IS1_Ind13 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTtc*g*c*c*g IS3_Ind13 cggcgaAGA TCG GAA*G*A*G*C CTA AGA IS1_Ind14 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTct*a*a*g*a IS3_Ind14 tcttagAGA TCG GAA*G*A*G*C CCG CTT IS1_Ind15 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTcc*g*c*t*t IS3_Ind15 aagcggAGA TCG GAA*G*A*G*C AAG TTA IS1_Ind16 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTaa*g*t*t*a IS3_Ind16 taacttAGA TCG GAA*G*A*G*C GGT ACC IS1_Ind17 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTgg*t*a*c*c IS3_Ind17 ggtaccAGA TCG GAA*G*A*G*C CCA GGT IS1_Ind18 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTcc*a*g*g*t IS3_Ind18 acctggAGA TCG GAA*G*A*G*C AAT CGA IS1_Ind19 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTaa*t*c*g*a IS3_Ind19 tcgattAGA TCG GAA*G*A*G*C AAC GCA IS1_Ind20 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTaa*c*g*c*a IS3_Ind20 tgcgttAGA TCG GAA*G*A*G*C GAC GAC IS1_Ind21 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTga*c*g*a*c IS3_Ind21 gtcgtcAGA TCG GAA*G*A*G*C CGC GCT IS1_Ind22 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTcg*c*g*c*t IS3_Ind22 agcgcgAGA TCG GAA*G*A*G*C CCG TAG IS1_Ind23 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTcc*g*t*a*g IS3_Ind23 ctacggAGA TCG GAA*G*A*G*C GTA ATC IS1_Ind24 A*C*A*C*TCT TTC CCT ACA CGA CGC TCT TCC GATCTgt*a*a*t*c IS3_Ind24 gattacAGA TCG GAA*G*A*G*C IS2 IndR IS3 IndR' GAC CTT IS2_Ind25 G*T*G*A*CTG GAG TTC AGA CGT GTG CTC TTC CGA TCT ga*c*c*t*t   [23] was followed to prepare the sequencing libraries. For each sample, a unique pair of P5 and P7 adapters with inline index was added, while the P7 adapters also have their own classic P7 index (Fig. 1B). Two different persons prepared the two batches of samples to represent handling variation in the lab. The libraries were sequenced on an Illumina HiSeqX 10 platform by Genewiz (Genewiz, Inc, Shanghai, China). We chose these two batches of samples to represent the common high-throughput sequencing applications where contamination has being problematic.

Data analyses
The raw reads were trimmed with trim_galore v0.6.4 (https:// www. bioin forma tics. babra ham. ac. uk/ proje cts/ trim_ galore/) to remove the adapter sequences and low quality reads. The duplicated reads due to PCR amplification were excluded using the "-fastx_uniques" command in USEARCH v10.0.240 [25]. A custom Perl script (demul-tiplex_inline.pl, supplementary files) was used to extract the first 6 bp of the sequences, compared to the sequences of the inline index pairs used for samples. The reads with both end matching the inline index pairs were consider as "true reads" and the rest reads were considered as "others". The "others" were further counted and separated into four part: (1) "cross-contamination" (con-tamination_cnt.pl, supplementary files), DNA fragments had inline index but instead of the correct one, they were used in other samples in the same batch; (2) "inline index mutation" (kmer_summary_1sub.py, supplementary files), the inline index sequences had one mutation to the correct one; (3) "inline index contamination" (kmer_summary.py, supplementary files), the sequences matched inline index used in our lab but not for the same batch of samples; (4) "unknown", all the rest reads cannot trace back to a source.

Composition of the reads
The "true reads" that have correct inline index at both ends accounted for 38.16 to 80.04% of the total reads of each sample in the eDNA experiment ( Fig. 2A). Samples 2 and 3 had substantial amount of contamination, probably due to operational errors dealing with these two samples. True reads consisted of 87.81 to 92.36% of total reads of each sample in the chondrichthyan study (Fig. 3A).

Correction of reads with index mutation
Because reads with inline index mutation accounted for most part of the "other" reads, we corrected the index sequences with one mutation to their closet indices used in our lab. After correction, the true reads increased to 48.67 to 90.51% for the eDNA experiment (Fig. 2B) and 91.27 to 95.63% for the chondrichthyan experiment (Fig. 3B).

Cross-contamination
Contamination from other samples accounted for 0.83-27.01% in the eDNA experiment and 0.29-1.25% in Fig. 2 Percentage of reads assigned correctly in the eDNA experiment. A before correction on reads with inline index mutation; B after correction on reads with inline index mutation  (Fig. 4). The results showed that cross-contamination ubiquitously exists no matter who run the experiment and whether the tubes were close to each other or not on the rack, so the contaminator did not necessarily jump from adjacent tubes (Fig. 5).

Inline index contamination
Some reads had at least one inline index that was not used in the same batch of samples but can be found in other inline indices used in our lab. Those could be due to cross-contamination between indices when they were synthesized or contamination through common reagents. The eDNA samples had 1.21 to 12.1% contamination due to inline index mixing, whereas the chondrichthyan samples had 1.16 to 2.71% (Fig. 4).

Unknown contamination
Besides the cross-contamination among samples and inline index contamination, there were a large portion of reads with unknown sources, which accounted for 6.9 to 20.3% in the eDNA samples and 4.07% to 7.47% in the chondrichthyan samples (Figs. 3 and 4). The unknown contamination was randomly distributed among samples.

Discussion
The results of "true reads" and "others" showed that the baseline contamination in our lab is around 10% to 20%, but an accidental handling can increase the number to almost 40%. Allio et al. [7] found very low cross-contamination (0.26%) in general in their genome shotgun sequencing data of swallowtail butterfly, but one sample, Parnassius imperator had a significant higher value, 26.71% of the contigs were contaminated, suggesting that human errors can have sporadic but severe impact on high-throughput sequencing data. We found that the "others" had similar composition between the two experiments. If we disregard samples 2 and 3 of the eDNA experiment, the composition of the "others", ranked from high to low was "inline index mutation", the "unknown", "index contamination", and "cross-contamination".
Because we applied a substitution distance of three base pairs, it should be safe to assign the reads with one base-pair mutation in their inline indices back to the "correct" sample. Indeed, we found few inline index had two substitutions. The index mutation could be caused by two reasons. One is sequencing error, which should be minimal because of the low error rate in sequencing the first six base pairs. The other error could happen while synthesizing the inline indices, so we should test the accuracy of synthesized indices from different provider in future.
The results of contaminated reads showed that the source of contamination was from sample 10, 7 and 1 in the eDNA experiment and from sample 8, 7 and 1 in the chondrichthyan study. The degree of cross-contamination also was not uniform. For example, samples 2 and 3 of the eDNA experiment had 24.93% and 27.01% contamination mainly from sample 10, suggesting handling errors may be involved. The route of cross-contamination may be through tube lids, gloves, pipette tips or aerosol.
The eDNA sample 7 had 12.1% inline index contamination, and mostly from one pair of inline indices, suggesting those may be contaminated by other library, prepared in our lab. We suggest that the index should be synthesized once at a time, so inline index contamination could be reduced. Extreme care should be taken when making the adapters in the lab, and the adapters should be divided into small aliquot for storage.
There was at least one unknown "inline index" sequenced for those reads. Our lab also has been using classic protocol of library prep [23], which has no inline index applied. We suspect those libraries that made with the classic protocol may have polluted common reagents, which in turn contaminated the samples of this project. Therefore, the first 6 bp base pairs of the reads cannot match any inline indices.
In the "others", reads with one mutation in the index can be corrected into its original sample, but reads with mixed inline index or unknown index sequence cannot be assigned to any sample and should be excluded from further analyses.
Most analytical approaches are based on data filtering. For example, Croco [26] is a database independent method that can be used to trace cross-contamination from divergent species, but difference between closely related organisms cannot be recognized. ConFindr [17] and be used to identifies contaminated samples, if it contains more than one allele of core single-copy ribosomal protein genes. Dickins et al. [14] proposed a two-part pipeline to identify the contaminated samples based on unexpected number of variants and a phylogenetic approach.
Inline-index method, however, is independent on the sequence of samples, so there is no requirement on the similarity between samples to deduce the composition of the reads. The reads can be recognized using inline index even the sample contain DNA from unknown species, such as the eDNA data, but the analytical methods may not work in this situation. Kircher et al. [27] designed a double-index technique to detect jumping PCR by adding indices to both P5 and P7 of Illumina adapters. Peyrégne et al. [28] implied that the double-index method may be used to monitor contamination. However, the double-index are added before sequencing, there are a lot of chances that contamination can happened before that. Rohland et al. [20] invented the inline-index method that adding a pair of unique barcodes to both ends of the DNA insert to trace the contamination. Because they aimed at ancient DNA research and paid most of their attention to the effects of unique barcode on damage rate, but did not test whether the inline-index approach can be used to mitigate problems of cross-contamination. Our research results showed that contamination occurred ubiquitously and the unique inline barcode can be used to trace the source of read contamination.
Furthermore, because samples are labeled with inline-index before sending them to sequencing facility, cross-contamination resulted in sequencing center can be controlled. Crosscontaminated reads from other samples can be assigned back to their origin based on their inline-index pair to rescue the data. Finally, with 24 pairs of inline indices, plus the P7 index, more samples can be multiplexed and sequenced in the same sequencing lane to save the cost.

Conclusion
The inline-index method is experimental, so it could be a good complementary to the analytical approach, such as CroCo and ConFindr. The widely use of the combination of these methods should be a routine in near future. Environment DNA samples have a nature of mixed source of DNA with various concentrations, so it is impossible to know if the reads are genuine or from other samples by using the computational approach. In this scenario, inline index method or other experimental method probably is the only way to address the contamination problem. In future, we can test whether the "blunt end" and "ligation" step of library prep can be combined together without washing steps in between, so to add the inline index at the very first step of library prep to further eliminate the chance of cross-contamination.