The six analysis modules of NPIP are shown in Figure 1A and explained, in turn, below.
2.1 Quality control (-qc)
The NPIP pipeline has a built-in function module of sequencing data quality control, which can support the quality control of an input file. In this step, NPIP utilises FASTQC (10) and MULTIQC (11).
2.2 Library demultiplexing module (-D)
The NPIP can also support the library splitting of sequencing files first, and then the quality control of the data of each sample separately. In this step, NPIP utilises FASTQC (10), MULTIQC (11), and Qcat (12) tools.
2.3 Querying module (-list)
The pathogen database was built into the NPIP analysis process. Currently, 16S full-length sequences of 159 pathogenic bacteria in 60 genera, and genomic sequences of 247 pathogenic bacteria are included in the database. All data were checked and proofread manually. NPIP provided a module to query which pathogens were covered in the current version of the database. Through the NPIP List (NPIP_list.pl), a user can find or list the pathogens in the current database.
2.4 16S analysis module (-16S)
When NPIP was used to analyze the metagenomic sequencing data of a clinical sample based on a Nanopore 16S Barcoding Kit (SQK-RAB204), NPIP could directly demultiplex the sequencing data into libraries (using -D parameter). NPIP using -16S parameter was also able to perform the quality control for individual samples using FASTQC(10), followed by sequence alignment with the built-in 16S database to preserve only the best matches. If the alignment sequence length was less than 1200 bp or similarity less than 90% (default parameter of NPIP), or the alignment similarity was lower than the minimum similarity of the genus or strain, the sequence was not retained for follow-up analysis. The NPIP report listed all match results at the genus and species level in a large to small order, as well as the match read number, percentage, and species that may be confused with this result.
For obtaining the 16S database in NPIP, all sequences with confirmed taxonomic relationships and read lengths ranging from 1200 to 2000 bp were selected using the public 16S rDNA databases, including the Ribosomal Database Project (RDP), National Center for Biotechnology Information (NCBI), and a database constructed from full-length 16S rDNA fragments that were extracted from complete genome sequences of nearly 100 pathogens collected, cultured, and sequenced by our unit. The Sequences in this 16S database covering nine variable regions of 16S rDNA (Figure 1) were located based on forward sequence (8F) 5¢AGAGTTTGATCCTGGCTCAG 3¢ and reverse sequence (1492R) 5¢ GGTTACCTTGTTACGACTT 3¢ (13) using a Perl script developed in our laboratory. For filtering low quality or candidate problem 16S sequences, we compared all sequences individually and listed all alignment results with identity ³80% using Blast(14) with parameters (-e 0.1 -m 8 -b 1000). Sequences with different best-match results at the genus level, as well as ones with an average identity <97% at the genus level, were defined as candidate problem sequences with the error taxonomy definition and were filtered out from this 16S database. All of the sequences that were filtered out were also confirmed based on a manual check through an NCBI web-blast. To further ensure the credibility of the 16S database, any sequences with ambiguous taxonomy results at the species level, such as ‘Vibrio sp.,’ were also filtered out. Based on the “List of human pathogenic microorganisms” (released by China’s Department of Health in 2006), we obtained a list of disease-related microorganisms (DRMs), as described in detail in our previous paper(15). Finally, we obtained a 16S database embedded in the NPIP pipeline, which covered 10,075 sequences from 60 candidate DRM genera and 159 species.
2.5 Genome analysis module (-G)
When NPIP was used to analyse the metagenomic sequencing data of a clinical sample, which could generated by nanopore with Ligation Sequencing Kit (SQK-LSK109), Rapid Sequencing Kit (SQK-RAD004), PCR Sequencing Kit (SQK-PSK004) or some other kits, NPIP could screen pathogens from the single sample sequencing data directly. NPIP could also handle multiple samples for a run using the -D parameter which could demultiplex the sequencing data into libraries (using -D parameter) first, and then screen the pathogens for each sample.
The genome database of pathogens was built into the NPIP pipeline. At present, 247 genomic sequences of DRMs are included in the database based on the list of DRMs previously obtained(15). All data were checked and proofread by hand. Sequencing data were compared with each genome in the database using Bowtie 2 (16). The number of alignment sequences, total alignment length, and the coverage was counted based on readouts that had an alignment length greater than 1000 bp and an identity >90%. Unique reads, which represented only sequences matched with the target genome, were also calculated and listed in the report.
2.6 Target pathogen analysis module (-T)
In the case where a pathogen genome was not covered in the current version database, NPIP provided an analysis module for screening a specific pathogen. This module could support the comparison and query of nanopore sequencing data based on the genome data provided by users and directly get the report page. The report of NPIP (-T) covers information about the match read num, the total align length, match coverage%, and match depth.
2.7 The local version of NPIP
NPIP is a local version of the software. Users can download the installation package for the local version to run on the local mainframe Linux (https://github.com/zhangwencdc/NPIP).
2.8 Online version of NPIP
The online version of NPIP for 16S (Figure 1B) and target pathogen analysis modules has been released on the MDACP platform (https://analysis.mypathogen.org), which allows anyone with access to the Internet to perform pathogenic microbiological analyses without the need for local computing expertise and resources.