Build a Bioinformatics Analysis Platform and Apply it to Routine Analysis of Microbial Genomics and Comparative Genomics

Genomics and comparative genomics have been increasingly used as routine methods for general microbiological researches. However, it is usually necessary to call several tools or even write some scripts to complete some simple analysis, which is complicated for most biological researchers. To simplify the operation process, especially for the convenience of microbiologists in the analysis, here we have developed PGCGAP, a comprehensive, malleable and easily-installed prokaryotic genomics and comparative genomics analysis pipeline, which implements genome assembly, gene prediction and annotation, average nucleotide identity (ANI) calculation, phylogenetic analysis, COG annotation, pan-genome analysis, inference of orthologous gene groups, variants calling and annotation and screening for antimicrobial and virulence genes. Although we have tried our best to simplify the installation and usage of PGCGAP, it may be dicult for non-bioinformatician users to master it. So, a protocol was created to help microbiologists without any experience in bioinformatics to establish their own bioinformatics platform and perform routine analysis. This protocol shows how to choose equipment, to install a Linux subsystem on a laptop with windows 10 system, to install PGCGAP and perform all analysis with an example dataset. The protocol requires a basic understanding of Linux, so an additional web page was written to help uninitiated users learn Linux and whole-genome sequencing (http://bcam.hzau.edu.cn/linuxwgs.php).


Introduction
Genome sequencing has become a routine method for common microbiological studies as continuously decrease in the cost of genome sequencing. Various tools have been developed for genome analysis. But for general users, it takes time to install and learn to use various programs and prepare the related input les. Even for some simple purposes, users need to spend much effort to integrate several tools or even write some scripts. For example, when we need a core-genome-SNP based phylogenetic analysis for the isolates from same species, we should successively use the following tools, Bowtie2 1 or BWA 2 for reads mapping, Samtools 3 or GATK 4 for SNP calling, and FastTree 5 or RAxML 6 for phylogenetic tree construction. Therefore, a comprehensive, exible and e cient pipeline for general analysis is urgently needed. We developed a prokaryotic genomics and comparative genomics analysis pipeline named PGCGAP to coordinate several genomic analysis software packages and in-house scripts to meet the various needs of microbiologists. Applications of the protocol PGCGAP can be used for (i) genome assembly, (ii) gene prediction and annotation, (iii) genome distance estimation, (iv) phylogenetic analysis, (v) COG annotation, (vi) pan-genome analysis, (vii) inference of orthologous gene groups, (viii) variants calling and annotation and (ix) screening for antimicrobial and virulence genes. It is worth noting that although the entire pipeline was developed for prokaryotes, some of the modules such as "Assemble", "MASH", "OrthoF", "CoreTree" and "AntiRes" can also be used for the analysis of eukaryotic genomes. In addition, "VAR" is applicable to the analysis of any haploid genome.
Advantages and limitations of this pipeline PGCGAP is versatile, feature-rich, easy to install and use, and friendly to microbiologists and bioinformatics beginners. New features will continue to be added. But a Graphical User Interface (GUI) has not been developed.
Expertise required to implement the protocol Users need to be skilled in using computers, and it will be easier to master this protocol if they have some Linux skills. A webpage introducing the basics of Linux, usage of common commands, software installation, and the whole-genome sequencing technology was developed to help users get started with bioinformatics. Please visit https://github.com/liaochenlanruo/pgcgap/wiki/Learning-bioinformatics or http://bcam.hzau.edu.cn/linuxwgs.php for more information.
Overview of the procedure Ten frequently used prokaryotic genomics and comparative genomics analysis processes were integrated into PGCGAP as different modules. Modules can be used separately or in different combinations for various purposes ( Figure 1). (i) "VAR" performs genome-wide variants calling by mapping methods.
Firstly, paired-end reads were mapped to a reference genome by BWA 2 after ltered by Sickle 7 . Secondly, variants calling and annotation were performed by Freebayes 8 and snpEff 9 , respectively. Then, the whole genome SNP alignment and core SNP alignment were obtained by snippy-core 10 . Finally, Gubbins 11 was used to remove SNPs in uenced by recombination events of the whole genome SNP alignment. (ii) "Assemble" performs genome assembly using ABySS 12 or Canu 13 . (iii) "Annotate" performs gene prediction and genome annotation by Prokka 14 . (iv) "ANI" computes Average Nucleotide Identity (ANI) between each genome pair by fastANI 15 . Three scripts "triangle2list.pl", "get_ANImatrix.pl" and "Plot_ANIheatmap.R" have been developed here to generate the ANI matrix and plot the correlation matrix heat map (Supplementary Figure S1), respectively. (v) "MASH" estimates genome and metagenome distance and similarity using MinHash 16 , and a heat map of genome similarity will be generated by two scripts "get_Mash_Matrix.pl" and "Plot_MashHeatmap.R". (vi) "Pan" calls Roary 17 calculate the pangenome. Two scripts "fmplot.py" and "plot_3Dpie.R" were developed for result visualization (Supplementary Figure S2). A phylogenetic tree based on single-copy core proteins called by Roary 17 will be constructed (Supplementary Figure S3). (vii) COG (Clusters of Orthologous Group) annotation was conducted by module "pCOG". Amino acid sequences of each genome were blasted against the COG database, and then all hits were mapped to the COG functional category by in-house scripts. R script "Plot_COG.R" was written for result visualization (Supplementary Figure S4). Comparison and visualization of COG functional categories among different genomes can be done by Perl script "get_ ag_relative_abundances_table.pl" and R script "Plot_COG_Abundance.R" ( Supplementary Figure   S5). (viii) "OrthoF" uses OrthoFinder 18 for phylogenetic orthology inference. Gene duplication events will be also predicted (Supplementary Figure S6). (ix) "CoreTree" was developed for genome-wide phylogenetic analysis based on the protein sequences or SNPs of single-copy core genes. Firstly, CD-HIT 19 was used to rapidly generate protein clusters, and then the protein sequences of single-copy core genes were extracted by Perl scripts and aligned by MAFFT 20 . Secondly, on one hand, alignments of protein sequences were concatenated, and the phylogenetic tree was constructed by FastTree 5 (Supplementary Figure S7). On the other hand, the protein sequence alignments were converted into corresponding codon alignments by PAL2NAL v14 21 . Then, the alignments were concatenated, and SNPsites 22 was called to nd SNP sites. Finally, FastTree 5 was used to construct the SNPs phylogenetic tree (Supplementary Figure S8). (x) "AntiRes" calls abricate 23 to screen for antimicrobial and virulence genes from contigs.

Selection of reference genome format for variants calling
The reference genome can be les in FASTA format and GenBank format. If a GenBank le rather than a FASTA le was supplied as the reference, annotation information of the variants will be generated to show to the user which feature was affected by the variants.
Adjust the kmmer size for better assembly results According to our experience, when the length of Illumina reads is 150 bp and "--kmmer" was set from 81 to 89, better assembly quality can be obtained. It is better for the user to check the report le in the assembly result to ensure that the value of N50 is greater than 50,000 bp, otherwise, the user can try to change the value of "--kmmer" to improve the assembly quality.
Choice of a module to construct the phylogenetic tree of single-copy core proteins Both "CoreTree" and "Pan" can be used to construct a phylogenetic tree of single-copy core proteins. That which module should be used depends on which type of input le the user has. "CoreTree" takes only the amino acid sequence les as inputs, while "Pan" needs both amino acid sequence les and Gff3 les.
Choice of a module to calculate pairwise genome distance Both "ANI" and "MASH" can calculate pairwise genome distance. "MASH" is more suitable for dealing with thousands of genomes because of its faster running speed. In addition to nucleotide sequences and assembled genomes, "MASH" can also take amino acid sequences and raw sequencing reads as inputs and can be used to calculate distances between metagenomic samples. Equipment A laptop, desktop PC or server can be used to build a bioinformatics analysis platform, and the suggested hardware requirements are listed in Table 1. Slightly lower features are also allowed (CPU must have four logical processors, memory must be greater than 8 G), but the computing speed may decrease, and the capacity of the hard disk can be adjusted according to actual requirements.

Procedure
Building a bioinformatics analysis platform on Windows 10 Windows Subsystem for Linux (WSL) allows users to install Linux subsystems directly on Windows 10 system. It can easily run Linux commands and install Linux software, avoiding the installation of thirdparty virtual machine software. The advantage of WSL is that it makes better use of computer memory and does not require copying les between the host and the virtual machine. Press Enter when prompted to view the license, enter "yes" and press "Enter" to continue. Press "Enter" to con rm the installation location. Miniconda was installed in the miniconda3 directory under the user's home directory. Type "yes" and press "Enter" to initialize. Finally, run the "source ~/.bashrc" command in the terminal.
$source ~/.bashrc (iii) Set up Bioconda channel. Add the channels by entering the following three commands in the terminal. $conda con g --add channels defaults $conda con g --add channels bioconda $conda con g --add channels conda-forge  In this example, the working directory locates at the H drive. All hard disks in Windows were mounted in the "/mnt" directory of Ubuntu Linux. The "PGCGAP_Examples/Reads/Illumia" directory contains six Illumina Hiseq paired-end reads of Escherichia coli, the "PGCGAP_Examples/Reads/Oxford" directory contains the Oxford Nanopore reads of Escherichia coli K12, and the "PGCGAP_Examples/Reads/PacBio" directory contains the Paci c Biosciences released P6-C4 chemistry reads of Escherichia coli K12. "PGCGAP_Examples/Reads/MG1655" is the GenBank format le of E.coli K-12 substr. MG1655, used as the reference genome. Paired-end reads of six strains in the directory "Reads/Illumina/" were used as inputs. In the dataset, the naming format of the genome is "strain_1.fastq.gz" and "strain_2.fastq.gz". The string after the strain name is "_1.fastq.gz", and its length is 11, so "--su x_len" was set to 11. $pgcgap --Assemble --platform illumina --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --kmmer 81 --threads 4 --su x_len 11 New directories and documents were generated after the program is nished. The assembly results for each genome are in the "Results/Assembles/Illumina" directory, and all scaffolds of the strains were stored in "Results/Assembles/Scaf/Illumina". Users are advised to check the assembly stats le (such as Results/Assembles/Illumina/SRR9620252_assembly/SRR9620252-stats.tab) of each genome to ensure that the value of N50 is greater than 50,000 bp. The le "scaf.list" under the working directory contains the absolute path of all genomes.

Example 2: Oxford reads assembly.
Oxford nanopore only produces one reads le ("Reads/Oxford/oxford.fasta"), so only the parameter of "--reads1" needs to be set, here the value is ".fasta". "--genomeSize" is the estimated genome size, and users can check the genome size of similar strains in the NCBI database for reference. The parameter was set to "4.8m" here. The su x of the reads le here is ".fasta" and its length is 6, so "--su x_len" was set to 6.

Example 3: PacBio reads assembly.
PacBio also produces only one reads le too ("Reads/PacBio/pacbio.fastq"), the parameter settings are similar to Oxford. The strain name is "pacbio" with the su x ".fastq" and the su x length is 6, so "-su x_len" was set to 6.

Example 4: Gene prediction and annotation.
Here, the assembly results of Illumina reads were taken as inputs ("Results/Assembles/Scaf/Illumina/*.fa"). The su x of the genome is "-8.fa". When running the program, the value of the "--Scaf_su x" parameter cannot be quoted. Here, -8.fa should not be quoted.
The phylogenetic trees of single-copy core proteins and single-copy core genes SNPs will be constructed using the six E. coli genomes sequenced by Illumina as datasets. The input les are the amino acid sequence les ("Results/Annotations/AAs/*.faa") and the nucleotide sequence les ("Results/Annotations/CDs/*.ffn") obtained by the genome annotation. Amino acid les and nucleotide les must be su xed with ".faa" and ".ffn", respectively. The ".faa" and ".ffn" les of the same strain should have the same pre x name. The name of protein IDs and gene IDs in the Amino acids le and nucleotide le should be started with the strain name. The Prokka 14 software was suggested to generate the input les.
Users can import these two les into MEGA 23 or iTOL 24 to view the topology.
21. Example 6: Constructing the single-copy core protein tree only.
If the "--CDsPath" was set to "NO", the nucleotide les will not be needed, and the phylogenetic tree of core SNPs will not be constructed too.
GFF3 les (With ".gff" as the su x) of each strain placed into a directory ("Results/Annotations/GFF/*.gff"). They must contain the nucleotide sequence at the end of the le. Protein sequence les (one per species) in FASTA format under another directory were also needed ("Results/Annotations/AAs/*.faa") if the parameter "--PanTree" was provided for constructing a phylogenetic tree. It should be noted that the "*.gff" le and the "*.faa" le must correspond. We strongly recommend using Prokka 14 to generate the les. If the "--Annotate" function was run rst, the les will be generated automatically.
The input le named "scaf.list" contains the absolute path of each genome, one per line. If the "--Assemble" function was run rst, the list le will be generated automatically. The value of the parameter "--Scaf_su x" depends on the actual situation, here is "-8.fa".
$pgcgap --ANI --threads 4 --queryL scaf.list --refL scaf.list --ANIO Results/ANI/ANIs --Scaf_su x -8.fa The results are stored in the "Results/ANI" directory. The le "ANI" contains comparison information of genome pairs. The document is composed of ve columns, each of which represents ANI value, count of bidirectional fragment mappings and total query fragments, respectively. A heat map le "ANI_matrix.pdf" was generated.
25. Example 10: Genome and metagenome similarity estimation using MinHash It takes genome les (complete or draft) in a directory as inputs (Default: Results/Assembles/Scaf/Illumina).
27. Example 12: Variants calling, and phylogenetic tree construction based on a reference genome.
The six genomes sequenced by Illumina were chosen as datasets ("Reads/Illumina/*.gz"). Taking Escherichia coli K-12 substr. MG1655 as the reference genome and the reference le "MG1655.gbff" in the GenBank format is stored in the "Reads" directory. The absolute path of the reference genome (here is "/mnt/h/PGCGAP_Examples/Reads/MG1655.gbff") is required to run the program.
28. Example 13: Screening of contigs for antimicrobial and virulence genes It takes genome les (complete or draft) in a directory as inputs (Default: Results/Assembles/Scaf/Illumina).
29. Example 14: Perform all functions for paired-end reads.
Only the reads le and reference le should be provided. For the sake of exibility, the "VAR" function needs to be added extra.

Troubleshooting
Troubleshooting advice can be found in Table 2.
Time Taken