Data source, curation and collection
Metadata of ChIP-seq, DAP-seq, ATAC-seq/DNase-seq samples (equivalent to datasets, accession numbers start with SRX/ERX/DRX) and projects (start with SRP/ERP/DRP) were retrieved from NCBI SRA (https://www.ncbi.nlm.nih.gov/sra), BioSample (https://www.ncbi.nlm.nih.gov/biosample), BioProject (https://www.ncbi.nlm.nih.gov/bioproject) and/or GEO (https://www.ncbi.nlm.nih.gov/geo) databases. ChIP-Hub has a focus on data in “green plants” (i.e., only considering plants in the taxonomy tree with a root ID 33090). Only data generated by Illumina platforms were kept. Firstly, each dataset was associated with publication(s) if available (about 90% samples can be linked with publications). Then, each dataset was manually curated to determine its investigated factor (i.e., which TF or histone modification mark), its experimental type (whether ChIP or control) and its associated replicates (experiment may have several replicates), based on the metadata and the original publications. Note that it is important to manually check the metadata based on its corresponding publication since some metadata was misannotated in the database. For example, the dataset SRX4063234 in fact contains two different samples, one for ChIP experiment (SRR7142417) and another for control experiment (SRR7142416). In this case, “Run” accessions (start with SRR/ERR/DRR) were instead used as sample accessions (ca. 250 of such cases). For datasets without related publications so far, they were marked as a “unconfirmed” status and would be regularly checked in the future. In general, one experiment may contain replicate samples (i.e., datasets), ChIP sample(s) as well as input control sample(s) and it was designed to investigate regulation of a specific factor (e.g., TF or histone modification) of interest under specific conditions. In the analysis (see the section below), each experiment was processed independently. Furthermore, annotation information for investigated factors was also manually curated. Broadly, factors are grouped into “TFs and other proteins”, “histone-related” or “unclassified”. For TFs, their gene IDs and family information were also determined if applicable. Finally, a meta file was obtained for each experiment after curation (see Supplementary Fig. 17 for examples), which is served as an input file for the ChIP-seq computation pipeline (see below).
Raw fastq files for each experiment were downloaded from the European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena) database. If fastq files were not available at ENA, raw data in the SRA format were downloaded from the SRA database and converted into fastq format using the “fastq-dump” command provided by the SRA Toolkit (version 2.5.1). The “--split-files” option was used for paired-end reads. Fastq files were further checked for completeness before submitted to analysis.
Genome sequences and gene annotations were downloaded from public databases (Supplementary Table 8). Additional annotation data were also included in the ChIP-Hub database in order to better annotate the regulatory factors and their regulatory networks. Annotation for miRNA genes were obtained from miRbase65 and their genomic locations were updated (by BLAST) based on current reference genomes. TF family information was retrieved from PlantTFDB66. TF DNA binding motifs were downloaded from the JASPAR67, CIS-BP68 and PlantTFDB66 databases and were scanned for occurrences in the genome using FIMO69. These data were provided as separated data tracks in the genome browser.
Data processing
We followed the ChIP-seq data analysis guidelines10 recommended by the ENCODE project to develop computational pipeline for various regulome data analysis (Fig. 1e). The analysis pipeline consists of quality control, read mapping, peak calling and assessment of reproducibility among biological replicates and was used to analyze all annotated experiments a standardized and uniform manner. Specifically, potential adapter sequences were removed from the sequencing reads using the Trim Galore program (version 0.4.1) and the quality of sequencing data was then evaluated by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Trimmed reads were mapped to the corresponding reference genomes using Bowtie2 70 (version 2.2.6) with parameters “-q --no-unal --threads 8 --sensitive”. The parameter “-k” was set to 1, 2 and 3 for diploid genomes (e.g., Oryza sativa), tetraploid genomes (e.g., Gossypium barbadense) and hexaploidy genomes (e.g., Triticum aestivum), respectively. Redundant reads and PCR duplicates were removed using Picard tools (v2.60; http://broadinstitute.github.io/picard/) and SAMtools71 (version 0.1.19).
Peak calling was performed using MACS2 72 (version 2.1.0). Duplicated reads were not considered (“--keep-dup=1”) during peak calling in order to achieve a better specificity73. The shifting size (“--shift”) used in the model was determined by the analysis of cross-correlation scores using the phantompeakqualtools package (https://code.google.com/p/phantompeakqualtools/). The parameter “--call-summits” was used to call narrow peaks. For broad marks of histone modifications (including H3K36me3, H3K20me1, H3K4me1, H3K79me2, H3K79me3, H3K27me3, H3K9me3 and H3K9me1), broad peaks were also called by turning on the “--broad” parameter in MACS2. A relaxed threshold of p-value (p-value < 1e-2) was used in order to enable the correct computation of IDR (irreproducible discovery rate) values10, because IDR requires input peak data across the entire spectrum of high confidence (signal) and low confidence (noise) so that a bivariate model can be fitted to separate signal from noise30. Following the recommendations for the analysis of self-consistency and reproducibility between replicates30, replicate control samples (if available) were combined into one single control in the same experiment. Peak calling was applied to all replicates, pooled data (pooled replicates), pseudo-replicates (half subsample of reads) of each replicate and the pseudo-replicates of pooled sample using the same merged control as input (if applicable). By default, “reproducible” peaks across pseudo-replicates and true replicates with an IDR < 0.05 were recommend for analysis. Besides, peaks with different statistical thresholds are available upon request. For example, “significant” peaks were defined as a fold-change (fold enrichment above background) > 2 and a -log10 (q-value) > 3; while “lenient” peaks as a fold-change > 2 and a -log10 (q-value) > 2. “Relaxed” peaks without additional thresholding were also provided so that any custom threshold can be applied. All peak-based analyses in the pipeline (including peak overlapping, merging and summary) were performed using BEDTools 74 (v2.25.0).
Various metric scores were calculated to assess different aspects of the quality of experiments (https://genome.ucsc.edu/ENCODE/qualityMetrics.html and https://www.encodeproject.org/data-standards/terms/; Fig. 2c,d and Supplementary Fig. 5). For example, library complexity is measured using the non-redundant fraction (NRF) and PCR bottlenecking coefficients 1 and 2 (PBC1 and PBC2). The SPOT (signal portion of tags) score, characterizing the enrichment of signal for each experiment, was calculated by the Hotspot75 algorithm by subsampling ten million reads. Fraction of reads in peaks (FRiP), another measure of enrichment, is highly correlated with the SPOT score (Supplementary Fig. 5). NSC and RSC (normalized and relative strand cross-correlation coefficient) are related measures of enrichment without dependence on pre-defined peaks, which were calculated by the phantompeakqualtools program76.
For visualization purpose, wiggle tracks (using pooled data across replicates) were generated by DeepTools77 with the “bamCoverage” program; different normalization methods (including RPKM [reads per kilobase per million mapped reads], CPM [counts per million mapped reads], BPM [bins per million mapped reads], RPGC [reads per genomic content normalized to 1x sequencing depth] and None) were used to generate different types of signal files. Data signal tracks were visualized in the JBrowse78 or the WashU Epigenome Browser79.
Annotation of promoters and enhancers
We adopted the same approach in our previous study41 to predict Arabidopsis promoters and enhancers.
Assignment of target genes
Regulatory elements (in layman's terms, called “peaks”) were assigned to putative target genes based on the following rules. For a regulatory region overlapping with any gene(s) (protein-coding genes or miRNAs), the overlapping gene(s) were considered as its targets. Otherwise, the regulatory element was assigned to its nearest annotated gene within up to N bp, where N is the median size of intergenic regions (N was set to 3000 if the median size exceeded 3000). The start of genes (i.e., the transcription start site [TSS] of protein-coding genes and the 5’ end of miRNA precursors [pre-miRNAs]) was used to calculate the distance. In general, this approach associates a single regulatory element with no more than two genes, with a few exceptions in the case of the regulatory element overlapping multiple genes. This procedure was performed in each species independently.
Chromatin state analysis
In order to use the collected histone modification ChIP-seq data from diverse studies for chromatin state analysis and to make the data more comparative among different plant species, only well-characterized H3-related histone modification marks (including H3K9ac, H3K27ac, H3K4me1/2/3, H3K9me1/2/3, H3K27me1/2/3 and H3K36me1/2/3) were considered and only data generated in wild-type plants were used. Furthermore, the datasets were broadly categorized into vegetative-, reproductive- and root-related samples based on their tissue specificity (Supplementary Table 7). In general, these broadly defined “tissue” types (termed reference tissue types) are more comparative among different plant species and difference in tissue collection by different studies can be eliminated. Although the analysis is cell type agnostic, it is informative even when the relevant cell or tissue type has not been experimentally profiled (this is the most case in plants so far). In addition, we filtered out experiments with less than 1000 called peaks and only considered plant species with at least five distinct types of histone modification marks. In summary, 251 experiments from five plant species were retained for chromatin state analysis (Supplementary Table 7 and Supplementary Fig. 10).
ChromHMM45 (version 1.19) was applied on the ChIP-seq data of histone modifications in three reference tissue types in five plant species to learn a multivariate HMM model for segmentation of genome in each tissue type. Specifically, the called peaks were first pooled from different ChIP-seq experiments for each type of the histone modifications in each tissue type for each genome separately. Peaks within blacklist regions were excluded from the analysis. The remaining pooled peaks were then processed by the “BinarizeBed” command (with the parameter “-peaks”) into binarized data in every 200 bp window over the entire genome. Models were trained independently for each reference tissue type in each genome since the composition of marks varied in different tissue types. We ran the “LearnModel” command with the number of states ranging from 2 states to 15 states and selected an “optimal-state” model based on a rule that the number of states appeared most parsimonious in terms of clearly distinct emission properties and clear interpretability of distinction between states (Supplementary Figs. 12-16). Furthermore, the resulting chromatin states were interpreted based on enrichment analysis of various types of functional annotations, such as gene elements, neighboring gene expression pattern, TF binding, chromatin accessibility and predicted enhancers41,80. To this end, the “OverlapEnrichment” and “NeighborhoodEnrichment” commands were used in the analysis. The meaningful mnemonics of states for Arabidopsis vegetative-related tissues was given in Fig. 6d.
Comparative genomics and cross-species comparisons
Whole-genome alignments were performed in a similar way as previously described51. Briefly, soft masked genomes were aligned to each other using the LastZ alignment algorithm81. Collinear alignment blocks separated by gaps of <100 kb were then “chained” according to their locations in both genomes and “netted” to choose the best sub-chain for the reference species82. For polyploid plants, each sub-genome was individually analyzed such that each contained non-overlapping chaining. The whole-genome alignments can be visualized together with epigenomic tracks through the integrated Epigenome Browser (see below).
Pairwise comparisons of chromatin states were performed by one-to-one mapping annotated regions between species based on the above whole-genome alignments. For regions mapped to multiple orthologous locations in the other genome (i.e., regions split over multiple alignment blocks), only the largest orthologous region in the same alignment block was considered. Marked regions were considered as conserved between species when their orthologous location in the second species overlapped a marked region by a minimum of 50%. Note that the minimum required overlap had little influence on the overall results given that the median value of overlaps is 100% and the mean value is 89.9%. To make state interpretations more comparable across different species (chromatin marks available for state prediction were slightly different among species, see Supplementary Figs. 12-16), the learned chromatin states were re-interpreted (Fig. 6g,h) based on a common set of marks as possible (Supplementary Fig. 18).
Analysis of gene regulatory networks
To study gene regulatory networks (GRNs) controlled by TFs with available ChIP-seq data, we focused on a specific network motif, TF-miRNA-TF feed-forward loops (FFLs), which involves targeting of a TF to both miRNAs and miRNA target TFs. Such trifurcate regulatory circuits are of importance for fine tuning of downstream gene expression83,84. We highlighted the analysis on Arabidopsis data since a comprehensive list of TFs have been investigated by ChIP-seq experiments in this plant species (Fig. 2b). The methodology, however, can easily be applied to data from any other plants when more and more data are generated. In the miRNA-mediated FFLs, target genes of miRNAs were predicted by the TargetFinder tool85, with a prediction score cut-off value set to 4. Other relationships (i.e., TF-miRNA and TF-TF) were supported by ChIP-seq data. The final meta-network consisted of regulatory relationships among 117 master TFs, 134 miRNAs and 462 common target TFs (Fig. 3d and Supplementary Table 3), covering nearly two-thirds of the predicted FFLs involved in flower development 31 (Fig. 3e).
Convolution neural network analysis
To identify sequence motifs enriched in dynamically accessible regions among different tissues we used Basset42, a convolutional neural network (CNN) approach. We set the input to the CNN as the 600 bp sequences centered at the summit of the top 2500 highly accessible peaks for each tissue. The output of the classifier is a binary vector of length 10 (i.e., the number of tissue types). We used default Basset values for most parameters, except that we set the number of first layer filters to 600. The CNN contained three convolutional layers, each followed by a rectified linear unit (ReLU) and a max pooling layer, and two fully connected layers. The network architecture is schematically shown in Supplementary Fig. 9c. The predictive performance of the networks was assessed by the basset_test.lua script in Basset.
ChIP-Hub Shiny application
In order to efficiently use our reanalyzed data by external users, we developed an integrative web-based application (ChIP-Hub) with the Shiny framework (http://shiny.rstudio.com/), which combines the computational power of R with friendly and interactive web interfaces (Supplementary Fig. 2). All the sample metadata, curated metadata and analyzed result data were loaded into a MySQL database, allowing for interactive retrieval through the ChIP-Hub interface. These data were presented in tabular and chart forms in our Shiny web application. Furthermore, the data can be searched by keyword or gene to select datasets of interests. The associated result files, such as wiggle signal files, peak files and additional annotation files, can be loaded into the integrated Epigenome Browser (https://compbio.nju.edu.cn/browser/) for visualization.
Online access and updates
To make this project easier to maintain for a long life and to update in time, we have developed a semi-automatic computational program (ChIPer) for this purpose. The program regularly (in very month according to our current plan) checks whether any new datasets available in public databases. If so, the new datasets will be sent for curation via email and the curated datasets will be automatically analyzed by the data processing pipeline. New result files will be checked and uploaded to our web server quarterly. Besides, we will include more functionalities in our Shiny application as required.
Statistical analysis
If not specified, all statistical analyses and data visualization were done in R (version 3.4.1). R packages such as ggplot2 and plotly were heavily used for graphics. All the sources data for each figure can be found in the Supplemental Information and the newest data can be found in our ChIP-Hub website.
Code and data availability
The data can be viewed, mined and downloaded through the ChIP-Hub website (http://www.chip-hub.org). ChIP-Hub can also be available at https://biobigdata.nju.edu.cn/ChIPHub/.