The complete repertoire of interactions between a particular transcription factor (TF) with its target genes constitute a gene regulatory network (GRN). Since GRNs are critical in development and differentiation (and in disease research), identifying target genes of a TF, or the TF targeting a given gene, is important to understand the regulatory basis of biological processes. Characterization of GRN of a TF in a given biological context involves two major steps: 1) Identifying the chromatin occupancy of the TF in the genome 2) Correlation of the chromatin occupancy to the regulation of its target genes. Chromatin occupancy profile of a TF is obtained by fixing proteins onto the DNA using a chemical agent and selectively purifying the chromatin using antibodies followed by sequencing (ChIP-seq)(reviewed in Park, 2009). On the other hand, the correlation of chromatin occupancy to the regulation of a target gene by a TF comes from differential gene expression (DGE) analysis performed on transcriptomic data (RNA-seq) (reviewed in Wang, Gerstein and Snyder, 2009). Overlaying chromatin occupancy information of a TF with the regulation status of its target genes, combined with the precise location and abundance of the TF in target enhancer sequences, not only provides the gene regulatory network of the TF, but also sheds critical insights into the mechanisms of transcriptional regulation.
The analysis of ChIP-seq and RNA-seq data involves various steps performed by individual programs (Zhang et al., 2008; Li and Durbin, 2009; Li et al., 2009, 2009; Robinson, McCarthy and Smyth, 2010; Anders, Pyl and Huber, 2015; Kim et al., 2019). However, most of these tools require some prior knowledge of programming language or computational dependencies and requirements. While certain programs exist which provide a detailed analysis pipeline for ChIP and RNA seq datasets (Liu et al., 2011; Afgan et al., 2018; Batut et al., 2021), we wanted to build a software which will be fairly simple in terms of initial analysis, and therefore, more accessible to the user base with no programming expertise. In addition, while individual programs exist to calculate the occurrence and/or presence of transcription factors (Grant, Bailey and Noble, 2011), we could not find a software which could quantify and compare TF occupancy information between two or three given set of sequences. In this context, we have developed an easy to use tool which when deployed by the user, using a docker container on a Linux platform, installs the various open source packages required for analysis of ChIP-seq and RNA-seq data. While the parameters used for alignment and processing of bam files are set to default, the shell scripts for such processes can be easily changed to incorporate changes as required by the user. Further, we introduce an Analysis module which employs the FIMO software of Meme Suite to analyze and compare TF motif occupancy information in user provided genomic sequences. Taken together, we provide a versatile tool which can be employed by the larger user base for parsing of sequencing datasets and motif-based analysis of genomic sequences.
Implementation
Motifizer Architecture
We developed the Motifizer software with an aim to make it independent of the computational infrastructure and expertise. To maximize ease of use, the pipeline incorporates all the pre-requisite tools baked into a Docker image along with custom Motifizer scripts required for data processing. The pipeline ensures modularity and provides entry points to ChIP-seq and RNA-seq processes at multiple stages. The docker image can be built at the user’s end by cloning the Dockerfile available at (https://github.com/abhikbhattacharjee/Motifizer) and deploying it using a few simple commands (view Readme file).
All processes carried out by Motifizer are written as a combination of shell scripts and python scripts, thereby ensuring that each process is executed successfully inside an isolated environment using docker container. To reduce the complexity of working on a CLI, a web based interactive GUI is provided based on the Flask framework. The modular design of the program allows experienced users to obtain more conclusive insights by altering the parameters/flags in shell scripts according to their needs (Supplementary Information 1).
In terms of computational needs, Motifizer can be deployed on a single computing node or any platform where the underlying hardware has higher potential to obtain faster results by directly deploying the docker image. To reduce the time consumed by some of the heavy processes, parallel execution in our program using GNU Parallel (Tange, 2018) has been incorporated, which helps to reduce the processing time. The Motifizer software can be run on any engine capable of running Docker containers.
Motifizer Functionality
The Motifizer software is divided into three main modules; a) The ChIP-seq module b) The RNA-seq module c) The Analysis module (schematic overview provided in Fig. 1A, 1B, 1C). The ChIP-seq module was built using field-standard tools which involve alignment of the raw files to the reference genome using the BWA software (Li and Durbin, 2009), conversion and filtering of bam files using SAMtools (Li et al., 2009; Li, 2011), peak calling using the MACS2 software (Zhang et al., 2008) and peak annotation and de-novo motif analysis using the Homer software (Heinz et al., 2010). The ChIP-seq module is further divided into a) Alignment module b) Peak calling and annotation module. The Alignment module requires single ended fastq files (test and input) and the reference genome fasta as input, and provides the bam files, peak file, annotation of the peak file and de-novo motif enrichment analysis as output (Fig. 1A). The Peak calling and Annotation module, on the other hand, skips the alignment step and requires bam files as input. While all modules have been developed using default parameters, the shell scripts of the executable files can be easily edited to change and/or add additional parameters (Supplementary information 1).
The RNA-seq module also utilizes field standard software and performs read alignment using the Hisat2 software (Kim et al., 2019), filters out duplicate reads using SAMtools, performs read counting using the HTSeq software (Anders, Pyl and Huber, 2015), and employs the edgeR package (Robinson, McCarthy and Smyth, 2010) to determine differentially expressed genes. The RNA seq module is further divided into a) Alignment b) edgeR analysis. The Alignment module requires fastq files (either single or paired end) and the reference genome fasta file as inputs and provides read count files as output (Fig. 1B). The edgeR module requires count files as input and provides the list of differentially expressed genes as output (Fig. 1B). Similar to the ChIP-seq module, the RNA-seq module can be easily edited in a text editor before execution to change parameters (Supplementary Information 1).
One of the novel aspects of Motifizer is an Analysis module which can be used to quantify and compare the frequency of transcription factor motifs in user provided genomic sequences. The Analysis module employs the FIMO software (Grant, Bailey and Noble, 2011) to scan and count occurrences of transcription factor PWMs in genomic sequences and provides the number and frequency of occurrence as output. The Analysis module requires genomic coordinates (in bed format) in an excel format (Supplementary Excel File_Template) and can be used to compare up to three different categories of genomic sequences (provided in three different sheets) (Fig. 1C). The program filters out Motifs with a correlation greater than 0.6 and provides a number of files as output including the bed files and fasta sequence files in a folder named “Sequence_files”, and the frequency of each transcription factor and their comparison in each group of enhancer sequence as an excel file named as “Final result”. The final excel file contains the number of times a particular TF was found in each group of enhancer sequence (column headers “Enhancer_Group1”, “Enhancer_Group2”, “Enhancer_Group3”), the frequency of each TF in the given sequences (column headers “Enhancer_Group1_frequency”, “Enhancer_Group2_frequency”, “Enhancer_Group3_frequency”), and the comparative analysis (column headers “Enhancer_Group1/ Enhancer_Group2”, “Enhancer_Group2/ Enhancer_Group1”, “Enhancer_Group1/ Enhancer_Group3”, “Enhancer_Group2/ Enhancer_Group3”, “Enhancer_Group3/ Enhancer_Group1” and “Enhancer_Group3/ Enhancer_Group2”).