Determination of a Criminal Suspect Using Environmental Plant DNA Metabarcoding Technology

Background: There are criminal cases that no frequently used evidence, for example, DNAs from the criminal, is available. Such cases usually are unresolvable. With the advent of DNA metabarcoding, evidences are mined from environmental DNA and such cases become resolvable. This study reports how a criminal suspect was determined by environmental plant DNA metabarcoding technology. A girl was killed in a rural wet area in China without a witness or video record. Pants with dried mud was found from one of her boyfriend’s house. The mud was removed from the pants and 11 more mud or soil samples surrounding murder scene were collected. DNA was extracted from the soil. Chloroplast rbcL gene fragments were amplied and sequenced on a next generation sequencing platform. Results: After bioinformatics analysis, of the 2980 ZOTUs in total obtained from the 12 samples, 1495 ZOTUs were identied to species, genera or families based on the existing public database. The feast analysis based on either taxa or taxa plus abundance data demonstrated that the mud on the suspect’s pants was from the criminal scene. Conclusions: The suspect nally made a clean breast of his crime. This case implies that plant DNA in the environment soil is a new source of evidence in determination of suspects using DNA metabarcoding technology and has high potentials of extensive applications in criminal cases.


Background
Human deoxyribonucleic acid (DNA) has been widely used in human individual identi cation (Ambers et al., 2018;Lygo et al., 1994;Meng et al., 2019), paternity identi cation (Bertoglio et al., 2020;Habibi et al., 2019) and other applications in forensics. However, human DNA is not always available. Under this situation, we have to resort to environmental DNA in the crime scene to narrow the search scope for criminal suspects and nd out the truth.
Environmental materials such as soil, dust, water, etc., are very likely to be taken away unintentionally by suspects on his or her skin, shoes, clothes, hair or even in the nail seams. Among them, soil, usually contaminated by plant fragments or pollen grains, is the material the police can get in most criminal cases. Plant DNA is quite suitable for the forensic source tracking because of its ubiquity, stability and proper variability.
Plant DNA has a high potential providing de nitive evidence during criminal investigations. With the advent of DNA metabarcoding, it has recently been used to nd out body dumping site (Yang et al., 2015), residence of unknown human body , drowning site (Fang et al., 2019), and con rmation of suspected drowning (Kakizaki et al., 2018). Unfortunately, such applications are still very rare due to three main challenges. The rst one is the di culties in species identi cation of plant DNA in the environmental materials. Past projects (e.g., BARCODE 500K (https://ibol.org), BIOSCAN (Hobern & Hebert, 2019), ISHAM-ITS (Irinyi et al., 2016)) have enriched the pool of DNA barcodes, though the reference library for DNA barcoding is rather not comprehensive. Only less than 5.0% species of owering plants have their matK or rbcL sequences deposited in GenBank (Liu et al. 2021).
The second challenge is that the Sanger sequencing method is not applicable to environmental DNA because the amplicons are a mixture of many species. Fortunately, next generation sequencing (NGS) platforms meet the requirement of environmental DNA metabarcoding and a very easy data processing method is now available (https://github.com/YanleiLiu1989/Cotu-master).
The last challenge is lack of an "ideal" DNA barcode for DNA metabarcoding (Ferri et al., 2015). DNA barcode is a short DNA sequence for species recognition and discrimination. DNA barcoding is a commonly used biotechnology in biology, environmental science, forensics, etc (Ferri et al., 2015;Hebert et al., 2003). It is a powerful molecular diagnostic method for specimen identi cation. Finding the best DNA barcodes (Dong et al., 2014;Dong et al., 2015;Kress & Erickson, 2007;Li et al., 2011) or developing new technical improvements (Yu et al., 2011;Xu et al., 2015) was one of the main themes for plant DNA barcoding during the past decade. Unfortunately, there is not a single ideal DNA barcode suitable for all plant species identi cation, and plant group-speci c DNA barcodes seem more realistic. For example, rbcL is much less variable than ycf1 in owering plants, but acceptable as a DNA barcode for lower plants Liu et al., 2020a).
The lower plants (algae) instead of higher plants (mosses, ferns and seed plants) play a very important role in investigation of wet environment-related criminal cases and rbcL has been proposed as a DNA barcode of diatoms (Liu et al., 2020a). The variability of rbcL is much higher in lower plants than in higher plants and rbcL is one of the few choices of DNA barcodes for lower plants for its relatively higher species coverage of existing sequences and universal PCR primers (Ferri et al., 2015).
In this paper, we demonstrate how to use mud collected from a criminal suspect's pants to determine the real criminal in a murder case happened in China based on DNA metabarcoding of diatom using chloroplast rbcL gene fragments. The diatom communities in the mud provided solid evidence of the suspect's appearance in the murder scene.

Data from the Ion Torrent S5xl platform
After sequencing on the Ion Torrent S5xl platform, a total of 2,917,507 raw reads was obtained. After quality control and read length selection, 2,754,982 clean data (94.43%) were retained. The mean sequencing depth of soil samples is 229,581 reads.

Total and annotated number of ZOTUs
A total of 2,980 ZOTU sequences were created using Usearch. The mean ZOTU sequences per sample were 961. Among 12 samples Site 5 − 1 was the most abundant sample with 1,176 ZOTU sequences, while Site 3 − 1 was the least abundant sample with 725 ZOTU sequences (Table 1). The mud on the suspect's pants was from the criminal scene All results came to the conclusion that the mud on the suspect's pants was from the criminal scene (Fig. 3).

Discussion
Environmental plant DNA, a new source of evidence for di cult crime cases Before the emergence of DNA metabarcoding technology, it is unrealistic to extract evidences from environmental DNA for forensic purpose because of high DNA cloning and sequencing costs. With the DNA metabarcoding technology, cold cases without witness, video record or human DNA become resolvable now. DNA metabarcoding powered by next generation sequencing (NGS) technology has now become a powerful approach in forensic evidence collection from environmental samples (Young et al., 2017). This study provides one more example demonstrating successful tracking of the source of mud on the suspect's pants via diatom community. Suspect-related environmental samples including diatoms or pollen are new sources of material evidences. Special attentions should be paid to some technical aspects such as contaminations, DNA barcodes, and data processing methods when using DNA metabarcoding data as evidence.
Organism contamination and false positive Plant particles such as leaf fragments, pollen grains and spores can be carried away by wind and contaminations to experimental samples are very likely to arise while collecting and processing samples.
Although higher plants cannot move, spores and pollen grains can move for a long distance (Rousseau et al., 2006). Such particles accumulate on surfaces on experimental benches, apparatuses and even clothes (Thomsen & Willerslev, 2015). Despite the amount of contaminants are very very small, they are detectable when ampli ed by PCR and sequenced on NGS platforms. A biological contaminant free laboratory is ideal for this purpose. To lower the risk of contaminants, only organisms speci c to unique environments can be considered, for example, diatoms in wet environment, forage herbs in prairies, crops in crop elds. Diatoms are of high species diversity and ubiquitous in wet environments. In this study, we barcoded diatoms in the soil because the accident happened in a canal and diatoms are very good indicators.
Suitable DNA barcode for DNA metabarcoding Whether a soil sample could be traced back to its original source localities depends on the suitable DNA barcode to be used. Although there is incompatibility between the universality of primers and the resolution of barcodes for higher plants, the universality of primers is more important for DNA metabarcoding of lower plants in soil samples because almost all DNA barcodes are variable enough to resolve most known taxonomic units (Liu et al., 2020b). Lower plants have shorter lifetime, evolve much more quickly and accumulate more genetic variations in their genomes than higher plants. However, lower plants are the least known creatures to taxonomists and quite large of them can only be identi ed to genus or even family levels. For example, rbcL is one of the least variable gene of seed plants, but its variability in lower plants is much higher (Dong et al., 2014) and can serve as a DNA barcode for lower plants such as diatoms (Liu et al., 2020a).
Another advantage of using rbcL as a DNA barcode for lower plants is that this gene locates in plastid genome, implying that contaminations of plastid genome free organisms bring no trouble to the data analyses. There is usually a vast range of microorganisms in soil samples, for example, insects, fungi, bacteria, etc. When using DNA barcodes from nuclear genome such as 18S, ampli cation of these organisms is usually inevitable, which needs quite large experimental and analysis resources.
A sequence reference library is not a prerequisite, but it is something better than nothing Soil sample source tracking using DNA metabarcoding (or any other methods) is based on a set of data (here considered a local library) and operation taxonomic units (OTUs) instead of species names are used.
This means that a universal reference sequence library is not necessary for forensics. As exempli ed in this study, results based on the total OTUs came to the same conclusion as that based on the annotated OTUs.
However, if the OTUs were annotated to species, genera or families, extra information such as morphological characters could be used and the evidences would be more solid.
To annotate the OTUs, a well-curated sequence reference library is indispensable. The reference library helps to exclude data of experimental artifacts (such as chimeras) and non-target species. Although some efforts have been made, the DNA barcode reference library is still far from being satisfactory due to low species coverage (Lou et al., 2010;Ratnasingham & Hebert, 2007;Tnah et al., 2019), especially for lower plants. For example, there are about 8397 known species worldwide and only 889 species have their rbcL sequences deposited in GenBank (4116 accessions. accessed on Jan. 9, 2021, rbcL in plastid genomes were not considered.).

Data analysis methods
DNA metabarcoding is NGS platform-based. Correct extraction of sequences is crucial for successful source tracking. There are several NGS data process pipelines (such as OTU, DADA2, COTU and etc.) and each of them has its own advantages and disadvantages. OTU pipeline, the earliest one, groups reads at a certain similarity (usually 0.97) and creates OTUs using very short computing time. DADA2 and Unoise3 do not adopt a subjective similarity value. COTU method, a recently proposed strategy, updates the OTU method by elongating the consensus sequence to be created (Liu et al. 2021) at the cost of computing time. Although there are some comparative studies on the pipelines (Prodan et al., 2020;Xiong & Zhan, 2018), it is still too early to say which one is the best.
The other important issue concerning data analysis is how soil samples can be reliably tracked back to the original place based on OTUs and their abundances. SourceTracker (Knights et al., 2011) and FEAST (Shenhav et al., 2019) are two most popular software packages for allocating components in a microorganism community to potential sources and the latter was claimed to be quicker and more accurate.
In this study, we tested source tracking accuracies of four kinds of data sets (four combinations between inclusion/exclusion of singletons and total/annotated OTUs) using FEAST. The results are nearly the same, indicating the high power of FEAST and reliability of the conclusion.

Conclusions
By using these results, the police successfully unmask the lie that the suspect has never been to the crime scene. Using this evidence as a breakthrough, the suspect nally made a clean breast of his crime. This case implies that plant DNA in the environment soil is a new source of evidence in determination of suspects using DNA metabarcoding technology and has high potentials of extensive applications in criminal cases.

Methods
The whole study procedure includes ve parts: soil DNA acquisition, amplicon preparation, amplicon sequencing, NGS data processing and suspect's mud tracking (Fig. S1).

Soil DNA acquisition
A total of 12 soil samples were collected, including one mud soil sample collected from the criminal suspect's pants, three soil samples from the center of crime scene and eight soil samples surrounding crime scene (Table 1, Fig. 1).
Approximately 25 mg of fully mixed soil of each sample was ground into powder in a grinder mill (MM400, Retsch GmbH, Germany) equipped with a zirconium magnetic bead at 29 Hz for two minutes at 30-second intervals to minimize DNA damage. Total soil DNA was extracted using the mCTAB method (Li et al., 2013).
DNA was resuspended in 100 µL of TE buffer, visually checked on 1.5% agarose gels, and quanti ed on a Nanodrop2000c sectrophotometer (Thermo Fisher Scienti c Inc., USA).

Amplicon preparation
Since this murder case happened in a small canal, we chose rbcL gene of diatom as our DNA barcode. The primer pair BacirbcL2f and BacirbcL2r (Liu et al., 2020a) was used for this case. DNA fragments from the same sample were labeled with a unique DNA oligo by PCR (Table 2, Fig. 2). A unique eight-nucleotide oligo for each sample was attached to the 5' end of both forward and reverse primers. PCR with a 10 µL mixture was conducted on Eppendorf instrument Mastercycler proS following Dong et al. (2015). The PCR products were checked by electrophoresis with a 1.5% agarose gel containing ethidium bromide under ultraviolet transilluminator. The DNA-labelled PCR products were mixed, puri ed using a puri cation kit (Aidlab Biotechnologies Co., Ltd, China) on a 2% agarose gel, and quanti ed on a Nanodrop2000c spectrophotometer.

Amplicon sequencing
A sequencing library of the nal PCR mixture was constructed for Ion Torrent platform using NEBNext® Fast DNA Library Prep Set for Ion Torrent (New England BioLabs, USA) and the library was sequenced at Maize Research Center, Beijing Academy of Agriculture and Forestry Sciences on Ion Torrent S5xl Chip400.

NGS data processing
Quality control and demultiplexing. NGS data quality control was carried out using the NGS QC toolkit with the default parameters (Ravi et al., 2012). After quality control, the NGS data from Ion torrent S5xl were demultiplexed using FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) according to the sample labels and primers (

Competing interests
The authors declare that they have no competing interests.
BLAST (Altschul, 2012) against NCBI database to assign scienti c names to ZOTU sequences if possible.
Organism abundance. Each of all ZOTU sequences was used as a reference and the reads from each soil sample were mapped to the reference under a similarity of 0.97 using Usearch (Edgar, 2013). The number of reads matching each ZOTU was recorded as abundance of the ZOTU (organism).

Suspect's mud tracking
The potential origins of diatoms found in the mud from the suspect's pants were tracked to the 11 candidate soil samples by fast expectation-maximization for microbial source tracking (FEAST, Shenhav et al., 2019). FEAST is a software developed for deducing the potential origin(s) of a microorganism community. FEAST estimates the fraction of organisms from the potential source as well as the other sources as unknown source, which helps to verify the true or false source of microorganism community in the mud from the suspect's pants. FEAST is currently implemented in R and easy to run following the instructions online (https://github.com/cozygene/FEAST).  Diagram showing how rbcL fragments were labeled by PCR, pooled together and sequenced on Ion Torrent S5xl platform.