Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is a technique that provides quantitative, genome-wide mapping of target protein binding events , . Identifying putative protein binding sites from large, sequence-based datasets presents a bioinformatic challenge that has required considerable computational innovation despite the availability of numerous programs for ChIP-Chip analysis , , , , , , . With the rising popularity of ChIP-seq, a demand for new analytical methods has led to the proliferation of available peak finding algorithms. Reviewing literature from the past three years, we noted 31 open source programs for finding peaks in ChIP-seq data (Table S1), in addition to the available commercial software. The sheer abundance of available software packages and technical variability with which they identify protein binding sites makes an assessment of the current methods timely. An appraisal of available analytical methods will better equip researchers to bridge the “next-generation gap” between sequencing and data analysis .
Recently, Pepke et al. published a review of the major steps in ChIP-seq analysis and detailed the algorithmic approaches of 12 available programs for detecting peaks (the signals of putative protein binding) from ChIP-seq data . For clarity, we have provided a brief overview of the main algorithmic treatments of ChIP-seq data; however, our focus here is evaluative rather than purely descriptive. The purpose of this study is to provide an impartial analysis to help readers navigate the myriad of options. Laajala et al.  provide some metrics for evaluating different methods, but leave many areas unexplored. Our work offers several improved ways to assess algorithm performance and address the question: which of the available methods for ChIP-seq analysis should I consider using?
The ChIP protocol ideally produces a pool of DNA fragments that are significantly enriched for the target protein's binding site. High throughput sequencing of these fragments generates millions of short sequence ‘tags’ (generally 20 to 50 bp in length) that are subsequently mapped back to the reference genome. By recognizing regions in the genome with increased sequence coverage, ChIP-seq experiments identify the genomic coordinates of protein binding events. ChIP-seq peak finders must discriminate these true peaks in sequence coverage, which represent protein binding sites, from the background sequence.
When examining tag density across the genome, it is important to consider that sequence tags can represent only the 5′-most end of the original fragment due to the inherent 5′ to 3′ nature of current generation of short-read sequencing instruments. This pattern results in a strand-dependent bimodality in tag density most evident in sequence-specific binding events, such as transcription factor-cis regulatory element binding (Figure 1). Most programs perform some adjustment of the sequence tags to better represent the original DNA fragment, either by shifting tags in the 3′ direction , ,  or by extending tags to the estimated length of the original fragments , , , , , , , . When the average fragment length can be accurately inferred (either computationally or empirically), the combined density will form a single peak where the summit corresponds closely to the binding site. If paired-end sequencing technologies are used, the fragment length can actually be measured directly allowing more precise determination of binding sites, a feature currently supported by only a handful of peak calling algorithms , , .
Figure 1. Strand-dependent bimodality in tag density.
The 5′ to 3′ sequencing requirement and short read length produce stranded bias in tag distribution. The shaded blue oval represents the protein of interest bound to DNA (solid black lines). Wavy lines represent either sense (blue) or antisense (red) DNA fragments from ChIP enrichment. The thicker portion of the line indicates regions sequenced by short read sequencing technologies. Sequenced tags are aligned to a reference genome and projected onto a chromosomal coordinate (red and blue arrows). (A) Sequence-specific binding events (e.g. transcription factors) are characterized by “punctuate enrichment”  and defined strand-dependent bimodality, where the separation between peaks (d) corresponds to the average sequenced fragment length. Panel A was inspired by Jothi et al. . (B) Distributed binding events (e.g. histones or RNA polymerase) produce a broader pattern of tag enrichment that results in a less defined bimodal pattern.
The first step in peak finding is to identify genomic regions with large numbers of mapped sequence tags. One approach to this task is to identify regions where extended sequence tags (XSETs) either overlap , ,  or are found within some fixed clustering distance , , , . Another commonly used method for finding enriched regions calculates the number of tags found in fixed width windows across the genome, an approach known as a sliding window algorithm , , , , , , , , . As this histogram-type density estimator can produce edge effects dependent on the window or bin size, some programs instead employ a Gaussian kernel density estimator (G-KDE) that generates a continuous coverage estimate , , . All these methods specify some minimum height criteria at which enrichment is considered significant, and some minimum spacing at which adjoining windows, clusters or local maxima (G-KDE) are merged into a single peak region.
Rather than searching for peaks in coverage, several methods leverage the bimodal pattern in the strand-specific tag densities to identify protein binding sites, either as their main scoring method ,  or in an optional post-processing filtering step , . Programs that use this signal exclusively, which we call “directional scoring methods,” are more appropriate for proteins that bind to specific sites (transcription factors), rather than more distributed binders, such as histones or RNA polymerase (Figure 1B).
CSDeconv, a recently published algorithm, uses both G-KDE and directional information in conjunction with a deconvolution approach, which enables detection of closely spaced binding sites . Such an approach has been shown to have higher spatial resolution, though the intense computational demands limit the size of genomes that can be analyzed. Developed expressly for use on a bacterial genome, CSDeconv and programs like it may represent an excellent choice for microbial ChIP-seq experiments with only a few binding sites, small genome size and high sequence coverage.
More specialized programs for the analysis of RNA polymerase ,  and epigenetic modifications , , , ,  ChIP-seq also have been developed. These proteins bind DNA over larger regions, producing relatively broad, low-intensity peaks that can be difficult to detect. Though we focus on identifying transcription factor binding sites from ChIP-seq data, we mention these additional methods should readers find them appropriate for their specific experiments.
Peak finding programs must determine the number of tags (peak height) or directionality score that constitutes “significant” enrichment likely to represent a protein binding site. An ad hoc method for dealing with this issue is simply to allow users to select some threshold value to define a peak . However, this simplistic approach does little to assist the user in assessing the significance of peaks and is prone to error. Other, more sophisticated methods assess the significance of sequence tag enrichment relative to the null hypothesis that tags are randomly distributed throughout the genome. The background modeled by the null hypothesis has been described previously using either a Poisson ,  or negative binomial model ,  parameterized based on the coverage of low-density regions in the ChIP sample. The actual background signal, however, shows decidedly non-random patterns ,  and is only poorly modeled  by these methods, which have been demonstrated to systematically underestimate false discovery rates .
To account for the complex features in the background signal, many methods incorporate sequence data from a control dataset generated from fixed chromatin  or DNA immunoprecipitated with a nonspecific antibody , . Control data can be used to make adjustments to the ChIP tag density prior to peak calling. Some methods implement background subtraction by calling peaks from the difference between ChIP and normalized control tag densities , , , while others use control data to identify and compensate large duplications or deletions in the genome .
Control tag densities are also used to assess the significance of peaks in the ChIP sample. One straightforward approach is to calculate the fold enrichment of ChIP tags over normalized control tags in candidate regions, to account for the fluctuating background signal , , , . More statistical sophistication can be incorporated by employing statistical models parameterized from the normalized control sample to assess the significance of ChIP peaks. Different programs have implemented models of varying complexity, such as Poisson , , local Poisson , t-distribution , conditional binomial , , , and hidden Markov ,  models. These statistical models are used primarily to assign each putative peak some significance metric, such as P-value, q-value, t-value or posterior probability. Control data can also be used to calculate empirical false discovery rates, by assessing the number of peaks in the control data (FDR = # control peaks / # ChIP peaks). Peaks are identified in control data either by swapping the ChIP and control data , ,  or by partitioning the control data, if enough control sequence is available , . The goal of all these different methods is to provide more rigorous filtering of false positives and accurate methods for ranking high confidence peak calls.
In this work, eleven peak calling algorithms are benchmarked against three empirical datasets from transcription factor ChIP-seq experiments. Our goal was to provide quantitative metrics for comparing available analysis programs based on the similarity of peaks called, sensitivity, specificity and positional accuracy. We find that many programs call similar peaks, though default parameters are tuned to different levels of stringency. While sensitivity and specificity of different programs are quite similar, more differences are noted in the positional accuracy of predicted binding sites.
Peak calling programs employ a wide variety of algorithms to search for protein binding sites in ChIP-seq data; however, it remains unclear to what extent these differences in methodology and mathematical sophistication translate to substantial variation in performance. Definitively benchmarking the performance of different peak calling programs is challenging, since there exists no comprehensive list of all genomic locations bound by the target under the experimental conditions (true positives). In lieu of using empirical data, an in silico “spike-in” dataset can be generated by adding a known number of simulated ChIP peaks to control sequence . However, such methods are, as yet, relatively unreliable due to challenges in mimicking the form and variability of empirical ChIP peaks.
We chose to test programs against three published transcription factor ChIP-seq datasets with controls: human neuron-restrictive silencer factor (NRSF) , growth-associated binding protein (GABP) , and hepatocyte nuclear factor 3α (FoxA1) . Each of these transcription factors has a well-defined canonical binding motif (see Materials and Methods) that can be used to assess ChIP-seq peak quality and confidence. NRSF represents a particularly attractive test case, as the 21 bp canonical binding motif, NRSE2 , has been rigorously defined and is relatively high information content relative to the shorter GABP (12bp) and FoxA1 (10 bp) motifs. For further validation, we also make use of extensive lists of qPCR verified sites that are available for NRSF (83 sites)  and GABP (150 sites)  (available online as Dataset S1). While the empirical ChIP-seq datasets analyzed herein do not address interesting issues concerning biological replicates, we feel that interesting facet of ChIP-seq analysis has been studied expertly in previous publications , .
Eleven peak calling methods capable of using control data were selected from the available open source programs, to represent the diversity of approaches in the different peak calling stages (Figure 2). To best approximate typical implementation by non-expert users, all programs were run with the default or recommended settings from the same desktop machine equipped with 4 Gb of RAM. While we note that some programs have many tunable parameters, we forgo extensive parameter optimization, which might have improved the results for some methods on the NRSF data, as this task is beyond the ken of most users.
Figure 2. ChIP-seq peak calling programs selected for evaluation.
Open-source programs capable of using control data were selected for testing based on the diversity of their algorithmic approaches and general usability. The common features present in different algorithms are summarized, and grouped by their role in the peak calling procedure (colored blocks). Programs are categorized by the features they use (Xs) to call peaks from ChIP-seq data. The version of the program evaluated in this analysis is shown for each program, as the feature lists can change with program updates.
For each of the three datasets, all peak callers reported a different number of peaks (Figure 3). The variation in the quantity of identified peaks indicates that default stringency levels are tuned differently among programs. A core set of peaks shared by all eleven programs was identified and found to comprise 75–80% of the smallest peak list for each ChIP-seq dataset (Figure 3). The set peaks shared by all methods suggests that smaller peak lists may, by and large, simply represent subsets of peaks called by programs with less stringent default parameters. Previous comparisons have offered only qualitative insights by examining the average overlap of a peak list will any different methods . To more rigorously address this question, we conducted a series of pair-wise comparisons between the peak lists from each method to determine which peaks were shared. These comparisons are presented in Figure 4 as the percentage of each peak list (column) shared with another method (row). For all three datasets, a smaller peak list shared an average of 92% of its peaks with a larger peak list from a different method, whereas larger peak lists shared an average of only 45–55% of peaks with smaller peak lists. These figures indicate that more stringent peak lists from some programs are nearly completely contained within the larger number of calls by other methods, similar to the more general findings of Laajala et al..
Figure 3. Quantity of peaks identified.
Programs report different numbers of peaks, when run with their default or recommended settings on the same dataset. Number of reported peaks is shown for the GABP (green bars), FoxA1 (red bars) and NRSF (blue bars) datasets. To assess how different these peak lists were, those peaks identified by all 11 methods were calculated (core peaks).
Figure 4. Pair-wise comparison of shared peaks.
Pair-wise comparisons of the peak lists for A) NRSF, B) GABP and C) FoxA1 were conducted to determine the number of shared peaks between each pair of two methods. Each panel shows the percentage of total peaks from one method (column) that shared with another method (row). Programs in rows and columns are sorted by increasing number of peaks and entries are shaded by color gradients such that red represents the highest shared proportion and blue, the lowest.
This issue begs the question: what is gained by calling more peaks? To address this matter, we began by examining qPCR-validated true positive sites available for NRSF  and GABP . The sensitivity of the methods was assessed by calculating the percentage of these true positives found by each program (Figure 5A,C). For NRSF, sensitivity of the different methods is remarkably similar up to the 1800 peak mark, after which SISSRS, E-RANGE and QuEST are slightly less sensitive. After 2500 peaks, the rate at which validated sites are discovered plateaus, yielding little gain in verified sites from the tail of the remaining peak lists. Sole-Search and CisGenome, which only identify about 1800 peaks, missed several positive sites picked up by programs calling more peaks. GABP showed more divergence in the sensitivity of the different programs to qPCR verified sites, with Sole-Search, CisGenome, and SISSRS falling well below the sensitivity of other algorithms. One of the most notable differences in performance between the NRSF and GABP datasets came from the Kharchenko's spp package, wtd and mtc, which were less sensitive in the GABP dataset. The decreased sensitivity of the spp methods on the GABP dataset may be caused by the broader enrichment regions noted in this dataset (see Figures S6, S7 and S8 and further discussion in the “Spatial Resolution” section). Directional scoring methods are known to be less useful for identifying broad enrichment signals, such as histone modification or RNA polymerase binding, due to blurring of the signal between the forward and reverse reads (Figure 1B).
Figure 5. Sensitivity assessment.
The percentage of qPCR verified positives that were detected by different programs is shown as a function of the increasing number of ranked peaks examined for the (A) NRSF dataset and its 83 qPCR-verified sites, or (C) the GABP dataset and its 150 qPCR-verified GABP binding sites. qPCR sites were classified as “found” if the center of the sites occurred within 250 bp of a program's predicted binding site (peak summit or peak region center). (B) Coverage of high confidence (FIMO p<1×10−7) NRSE2 motifs or (D) high confidence (FIMO p<1×10−6) GABP motifs throughout the human genome as a function of increasing ranked peaks examined. Motif occurrences were covered if the center of the motif occurred within 250 bp of a program's predicted binding site (peak summit or center of peak region).
Though high in confidence, the qPCR gold-standards cover only a handful of sites across the genome, perhaps limiting our ability to assess more subtle difference in sensitivity. To gain a more comprehensive picture of sensitivity between these methods, a whole genome scan for the presence of high confidence canonical binding motifs was conducted. This approach, which permits an assessment of sensitivity from a larger database, generated a list of more than 3000 potential NRSF and 6500 GABP binding sites. The coverage of these motif occurrences largely recapitulates the patterns seen with the qPCR binding site analysis, suggesting that the similarities observed with the high confidence qPCR database are not simply artifacts of the small sample size (Figure 5B,D). In summary, the sensitivity of all methods on the NRSF dataset remains remarkably similar over most of the peak-lists, while more noticeable differences emerge in examining the GABP data. The similarities from the NRSF data likely emerge from the fact that many algorithms may have been tested and trained on this same dataset, thereby optimizing their default settings. The differences seen with GABP highlight the potential variability in performance and seem to indicate that, for this dataset, directional scoring methods were less sensitive (SISSRS, mtc, wtd), corroborating the findings from our qPCR analysis.
It is important, however, to consider that high confidence motif sites represent putative binding sites for the transcription factor. Some sites may not be occupied under the experimental conditions and may not even be present in the cell line's genome, given that cell lines are prone to genomic instability. Thus, while the co-occurrence of motif instances and detected peaks likely represent true binding sites, the failure to identify a peak at a motif site has a several possible explanations.
Assessing the rate of false positives in the peak lists is a challenging task. The available set of qPCR-determined negative sites for NRSF provides only 30 “true negatives”, defined as sites where enrichment was less than 3 fold . By this standard, nine of eleven programs called a total of two putative false positives (CisGenome and QuEST found none). The same two “true negative” sites (chr20: 61280784–61280805 and chr6:108602345–108602365 in hg18) were identified by all nine programs. Although this could indicate some systematic bias in peak calling, Kharchenko et al. argue that, based on sequence tag distributions, these sites are likely bound by NRSF under the ChIP-seq experimental conditions (see Supplementary Fig. 9 from Kharchenko et al. ). Thus, we find these “negative” sites and their corollaries in the GABP dataset unreliable for assessing the specificity of the different programs using metrics such as a receiver operator curve (ROC), despite the fact that other groups have used this metric previously .
In the absence of an appropriate dataset for rigorous false positive testing, many investigators prefer to examine a stringent set of binding sites. Thus, programs must provide accurate means for ranking peaks according to some confidence metric. To assess peak ranking accuracy, we calculated the rate of canonical motif occurrence for NRSF, GABP and FoxA1 within additive intervals of 50 peaks (top 50, top 100, top 150, etc; Figure 6 and Figures S1, S2). The percentage of peaks containing high confidence motifs decays with decreasing peak rank, suggesting that rank generally discriminates well between high confidence and lower confidence peaks. The performance of the different ChIP-seq methods at detecting high confidence NRSF binding sites is very similar; the percentage of motif-containing peaks varied by less than 3% with the exception of PeakSeq and HPeak. More variability is seen in the ranking of the top 50 peaks, though the methods still differ by only 10% when the outliers (PeakSeq and HPeak) are excluded. Over the first 2000 peaks, PeakSeq and HPeak detect between 10 and 20% fewer peaks with strong motifs than other algorithms. However, when a larger window (1 kb) surrounding the peak center is examined, the performance of these methods is comparable to other programs (Figure S3). This result suggests that both PeakSeq and HPeak identify peaks with lower positional resolution than other methods for the NRSF dataset. The decay of motif content in ranked peaks for the other two datasets were similarly tightly clustered, showing relatively little variation with the exception of slightly poorer performance for Sole-Search in the GABP dataset and QuEST in the FoxA1 dataset (Figure S1 and S2, respectively). While changes in the significance threshold set for defining a motif occurrence impacted absolute percentage of peaks containing motifs, such changes did not alter the performance of the programs relative to one another (Figure S5). Another interesting point with regards to peak ranking is that the different statistics provided by the same program can produce substantially different rankings, with variable success at determining high-quality peaks (Figure S4).
Figure 6. Ranking accuracy.
Ranked peak lists were examined in increasing 50 peak intervals (50 peaks, 100 peaks, etc.). Peaks were deemed to contain a high confidence NRSE2 motif if a MAST search of the region surrounding the predicted binding site (peak summit or peak region center) yielded a motif within 500 bp (p<1×10−6) of the center. The percentage of peaks containing motifs was evaluated for each interval for all eleven methods.
This peak ranking analysis provides considerably more practical information to the user than does the motif analysis conducted by Laajala et al. , which simply reports the average significance of motif overlap with all peaks. Our results support their general conclusion that the whole peak lists from all programs show significant proportion of the canonical binding motif and also demonstrate the significance of peak rank in recovering high confidence motif sites.
We note that the absence of a strong motif occurrence does not definitively classify peaks as false positives, as some such peaks could represent true binding sites with weak or non-canonical binding motifs. Nonetheless, high confidence motif occurrences within peaks are a good indicator of an actual binding event and can be used to assess how well peak ranking identifies the most confident binding sites. Furthermore, previous studies of non-canonical motifs suggest that these sites makes up a relatively minor fraction of overall motif occurrences .
Given the vagaries of ChIP enrichments, it is important to consider the robustness specificity in peak calling with “noisy” data. Less efficient ChIP enrichments will produce datasets with a larger ratio of non-specific background sequence to ChIP-targeted sequence. Such datasets will thus be characterized by higher background noise, lower peaks and under-sampling of low-intensity peaks. The complexity of features in the background sequence (discussed in Introduction) makes modeling “noise” features extremely challenging. We have simulated noisy datasets in silico by removing randomly sampled ChIP reads from Johnson et al. 's NRSF dataset and introducing an equal number of reads from the background data. Datasets were simulated where the noisy ChIP sample was composed 10%, 30% and 50% reads sampled from the background control dataset. These increasingly noisy datasets are meant to simulate decreasing efficiency ChIP enrichments with the same sequencing coverage.
As expected, the number of peaks called decreases in simulations of less efficient ChIP (Figure S6). The size of the decrease tended to be most marked for programs that called larger peak lists, suggesting that it was the smaller peaks were lost in the noise. This conclusion was borne out in by searching for canonical motifs in the ranked peak lists from our simulated noisy data. Few differences were observed between variable noise datasets in the motif content of ranked peaks (Figure S7), indicating that though all programs lost some peaks in the noise, they tended not to increase spurious peak calls. QuEST showed the most notable decay of motif content in noisier datasets, likely because this algorithm's background filtering method relies on larger control datasets. In noisier simulations, HPeak and PeakSeq showed increasing motif content in the top 500 peaks, such that it seems that their ranking algorithms performed better on noisier datasets. Further investigation is needed to discover the origin of this phenomenon, though we suspect that this may be due to better spatial precision in their identifications. In summary, however, we find few substantial differences between the performance of these programs on our simulated datasets at increasing noise thresholds.
In addition to discriminating the true binding sites, a ChIP-seq peak finder should identify that binding site with some degree of precision to facilitate the location of DNA-protein binding. The width of identified peaks can be an important consideration for de novo motif searches of peaks identified by ChIP-seq, since extraneous sequence around the true protein binding adds significant noise that can obscure the motif signal. Most programs will report a peak region of variable width, given by start and stop coordinates. However, directionality-scoring methods tend to report either narrow fixed width peaks (SISSRS) or single coordinate peaks (spp package), rather than the wider regions reported by other methods. For both the FoxA1 and NRSF datasets, the median peak width was between 250 and 400 bp for all methods reporting peak width ranges, with the exception of CisGenome which had smaller median peak width (72 bp for NRSF and 90 bp for GABP; Figure S8 and S9). In contrast, peaks called from the GABP dataset tended to be wider, with median peak widths ranging from 300 to 800 bp, excepting CisGenome which was only 90 bp (Figure S10). This observed variance between datasets emerges either from actual differences in transcription factor binding (GABP binding in a more distributed manner), from variation in the preparation of samples (such as differences in antibody specificity or size selection during the preparation of the sequencing library) or a combination of such factors.
In general, programs also provide an estimate of the exact binding position, given as a single coordinate calculated either as the highest point of tag coverage in the peak or by some other scoring metric. This coordinate is meant to aid the researcher in honing in on section of DNA originally cross-linked by the target protein during the ChIP-enrichment step. Though there is no single nucleotide at which cross-linking occurs, this estimate is meant to facilitate the precise discovery of cis-regulatory elements . To assess the positional accuracy of these estimates made by different programs, the distance was calculated between each predicted binding coordinate and the centers of high confidence binding motifs within 250 bp (Figure 7, Table S3). Binding positions were estimated as the center of the reported peak region, if the program did not provide a predicted binding coordinate (HPeak, PeakSeq and Sole-Search; starred in Figure 7). Unsurprisingly, all three datasets revealed that these centered estimates provided much less positional resolution than the precise predictions of binding positions by other programs.
Figure 7. Positional accuracy and precision.
The distance between the predicted binding site and high confidence motif occurrences within 250 bp was calculated for different peak calling programs in the (A) NRSF, (B) FoxA1, and (C) GABP datasets. Negative distances indicate that the motif was found before the peak coordinate (e.g. a motif centered at chr1∶1000 and predicted binding site at chr:1050 corresponds to a distance of −50bp). The variation in distances from predicted binding sites to motif center is presented as a box-and-whisker plot for each program. Starred programs (*) indicate that these methods did not provide a predicted binding coordinate; so binding positions were estimated as the center of the reported peak region. Exact numbers are available in Table S3.
For all programs, the positional accuracy was lower for the GABP dataset (Figure 7C) than for either FoxA1 or NRSF. Keeping in mind the wider peak regions called for the GABP dataset, we conclude that the signal from binding events in this GABP dataset is likely broader, which makes precise estimation of the binding location more challenging. However, the same trends in each program's positional accuracy were observed throughout the three datasets, despite changes in the absolute magnitude. The predictions for QuEST and HPeak were both consistently shifted downstream from the nearest high confidence motif occurrence (3′ direction, positive shift in Figure 7), indicating some unknown, systematic bias in these unrelated algorithms. MACS and the three directionality dependent methods (SISSRS and Kharchenko's wtd and mtc programs) were provided some of the best spatial resolution of binding events. The success of directionality scoring methods follows logically from their search strategy which, unlike other methods, hinges upon identifying a single “transition point” between the tag densities on the sense and antisense strands.
Selecting a peak detection algorithm is central to ChIP-seq experimental studies. Though the algorithmic details may seem arcane to many biologists, computational analysis is the key to leveraging meaningful information about biology from sequence-based data. We demonstrate that eleven ChIP-seq analysis programs of varying algorithmic complexity identify protein binding sites from common empirical datasets with remarkably similar performance with regards to sensitivity and specificity. We find few substantial differences between the performance of these programs on our simulated datasets at increasing noise thresholds. A more complete analysis of the origin of noise and improved metrics for determining the noisiness of datasets would certainly benefit future in ChIP-seq experiments.
The programs differed most significantly in the spatial resolution of their estimates for the precise binding region. The best estimates of precise binding location were provided by Kharchenko et al. 's ChIP-seq processing pipeline (spp) , which uses directionality scoring, followed shortly by Zhang et al. 's popular MACS program . These tools would be an excellent choice especially for applications such as de novo motif discovery in regions with multiple motifs, where it is important to accurately minimize sequence search space. We base our observations on the analysis of sequence data generated exclusively from transcription factor ChIPs. Since different physical factors inherently influence peak profiles from non-transcription factor ChIPs (e.g. RNA polymerase, epigenetic modifications) we expect algorithm performance to differ significantly for such datasets. Several algorithms have been written to specifically address this issue and should be chosen in lieu of those evaluated herein if non-transcription factors are being studied , , , , , , .
Given the similarities in performance, the implementation and general usability of the different programs is an important factor in choosing an analysis tool (Figure 2). Most programs are run from the command line and require variable degrees of data formatting and computation expertise to implement. Kharchenko et al. 's ChIP-seq processing pipeline (spp) is run as a package from within the statistical program R, which facilitates data visualization and downstream analysis for the statistically-inclined user. CisGenome  and Sole-Search  can be implemented with a graphical user interface (GUI) which is important consideration for the bench-top biologist. CisGenome provides an integrated platform for ChIP-chip and ChIP-seq analysis, combined with downstream motif finding and an integrated genome browser; however, the CisGenome GUI is currently restricted to the Windows platform. Sole-Search runs a cross-platform compatible Java-based GUI that locally formats and compresses the data before uploading it to a web-server for remote analysis, a useful feature for users with limited computing resources and expertise.
An important consideration for ChIP-seq peak detection concerns the desired balance between sensitivity and specificity in compiling the final candidate peak list. Depending on the biological question, the user may want to examine either a stringent list of the most-confident peaks or a more comprehensive set of peaks that may include more false positives. It is crucial that this balance of stringency and sensitivity be a tunable to the needs of the user. Changing various parameters in each program and re-running the analysis can adjust the number of peaks reported. Alternatively, the user can simply rank called peaks according to some peak statistic (such as number or tags, fold enrichment over background, or p-value) and analyze only the top n-peaks where n is adjusted according to the researchers' desired stringency. Relative to previous reviews of ChIP-seq algorithms , our analysis provides considerably more resolution throughout the peak lists (50 peak intervals) and offers a better glimpse at how peak “quality” declines with decreasing rank.
We have demonstrated that ChIP-seq peak callers need not be overly sophisticated in their algorithmic approach to achieve comparable performance identifying relatively stringent lists of binding sites. While our assessment suggests that improvements in peak calling specificity and sensitivity are possible, it seems clear that the field faces a conundrum. It is challenging to rigorously assess subtle improvements due to the scarcity and unreliability of verified binding sites for any ChIP-seq dataset. Furthermore, without adequate verification data for false positive testing, the decision of how many peaks to evaluate as putative binding sites remains a matter of biological intuition combined with trial and error, despite layers of statistical sophistication. Recent studies ,  suggest that using full biological replicates in ChIP-seq experiments may provide the most reliable manner of filtering false positives from true binding sites, a practice already encouraged by several groups such as the ENCODE consortium , . We suggest that rather than focus solely on algorithmic development, equal or better gains could be made through careful consideration of experimental design and further development of sample preparations to reduce noise in the datasets.
Raw sequencing reads for the NRSF dataset  (kindly provided by A. Mortazavi) and GABP dataset  (downloaded from the QuEST website, http://mendel.stanford.edu/SidowLab/downloads/quest/) were aligned to the human genome (NCBI Build 36.1) using Bowtie . The FoxA1 dataset  was downloaded as reads aligned to the human genome (NCBI Build 36.1) from the MACS website (http://liulab.dfci.harvard.edu/MACS/Sample.html). The datasets had the following number of uniquely mapped sequence reads, NRSF ChIP: 2,088,238 with 3,079,013 input control reads; GABP ChIP: 7,829,282 with 17,299,213 input control reads; FoxA1 ChIP: 3,909,805 with 5,233,683 input control reads.
Unless otherwise specified all peak calling programs were run with default or recommended setting from a 2.66 GHz Intel Core i5 MacOSX desktop machine equipped with 4 GB of RAM. CisGenome GUI mode was tested on a virtualized instance of the Windows OS running from the aforementioned Mac. The Sole-Search program runs by default via submission to a web-server. Peaks with overlapping coordinates from different program's peak lists were determined by pair-wise comparison using BEDTools .
Peak lists that were not ranked automatically by programs were sorted according to peak characteristics reported by each program (Supplemental Table S2). PeakSeq and CisGenome return ranked lists by default. The Minimal ChIP-seq Peak Finder peak list was sorted by the number of reads in the cluster, E-RANGE by the fold enrichment and then by p-value, HPeak by peak's maximum coverage, SISSRS by fold enrichment and then p-value, MACS by the 10*−log10(p-value) and then by fold enrichment, the wtd and mtc methods from the spp package by the false discovery rate and then by the score, and Sole-Search by the peak's read count and then by the effect size. The regions in the QuEST peak list were sorted by q-value rank and only the most significant peak in each region was retained as QuEST's estimate of the exact binding site.
Positional Accuracy and Peak Motif Content.
All motif searching was conducted using programs from the MEME/MAST package  and the following instances of the TF's canonical binding motif: the well-defined NRSE2 motif  was used for NRSF, while the TRANSFAC  database motifs were used for GABP (M00341) and FoxA1 (M01261). An exact binding site prediction was available from all programs except PeakSeq, Sole-Search and HPeak (though HPeak 2.1 has this feature, this version was available only for the Linux OS at the time of writing). In the absence of a predicted binding site, the center of each peak region was substituted as an exact binding site prediction. Regions 250 base pairs upstream and downstream from the predicted binding site were searched using MAST  for the high confidence hits of the canonical motif for the TF. Positional accuracy was assessed for the top 1500 peaks from each method as the distance from the predicted binding site to the center of the closest high confidence motif occurrence within 250 bp. The percentage of peaks containing at least one significant motif within 250 bp of the predicted binding site was calculated for additive 50 peak increments throughout the each program's ranked list of peaks.
Eighty-three qPCR validated NRSF-positive sites were obtained from Mortazavi et al.  and 150 qPCR GABP-positive sites were found in Collins et al.. A set of 3002 high confidence NRSE2 motif  occurrences in the human genome were identified by FIMO  search of human genome build NCBI Build 36.1, using cutoff of p <1×10−7. For GABP, a set of 6670 motif occurrences in the human genome were identified by FIMO  search using a cutoff of p <1×10−6. The corresponding FIMO search for the FoxA1 motif returned >40,000 highly repetitive motif occurrences, having only 2 distinct p-values. Unable to define a subset of high confidence motifs in the whole genome, sensitivity analysis was not conducted for FoxA1. For NRSF and GABP, the number of high confidence motif occurrences found within peak regions was determined for 1-peak increments throughout each ranked peak list, using a combination of custom Perl scripts and BEDTools .
Survey of open-source ChIP-Seq analysis programs. References that also appear in the main text are numbered accordingly. Supplementary references are indicated by S1 (etc) and NA indicates that the program has not yet been published. Websites hosting the code are provided for each method, unless the code was not publicly released at time of writing (usually available on request from authors).
(0.07 MB DOC)
Methods used to rank peak lists from different programs. If programs returned a sorted peak list by default, no further sorting was conducted (NA). Secondary sorting method was used to break ties following the primary sorting.
(0.03 MB DOC)
Median and standard deviation of positional accuracy data. Median and standard deviation of the distance from estimated binding sites to the nearest high confidence motif occurrence, measured in base pairs. Measurements conducted for the top 1500 peaks in each peak list. Represented graphically in Figure 7 of the main text.
(0.04 MB DOC)
GABP ranking accuracy. Ranked peak lists were examined in increasing 50 peak intervals (50 peaks, 100 peaks, etc.). Peaks were deemed to contain a high confidence GABP motif if a MAST search of the region surrounding the predicted binding site (peak summit or peak region center) yielded a motif within 500 bp (p<1×10−4) of the center. The percentage of peaks containing motifs was evaluated for each interval for all eleven methods.
(0.56 MB EPS)
FoxA1 ranking accuracy. Ranked peak lists were examined in increasing 50 peak intervals (50 peaks, 100 peaks, etc.). Peaks were deemed to contain a high confidence FoxA1 motif if a MAST search of the region surrounding the predicted binding site (peak summit or peak region center) yielded a motif within 500 bp (p<1×10−4) of the center. The percentage of peaks containing motifs was evaluated for each interval for all eleven methods.
(0.54 MB EPS)
NRSF Ranking accuracy revisited (1 kb regions). Ranked peak lists were examined in increasing 50 peak intervals (50 peaks, 100 peaks, etc.). Peaks were deemed to contain a high confidence NRSE2 motif if a MAST search of the region surrounding the predicted binding site (peak summit or peak region center) yielded a motif within 1 kb bp (p<1×10−6) of the center. The percentage of peaks containing motifs was evaluated for each interval for all eleven methods for the top 2000 peaks.
(0.51 MB EPS)
Different confidence metrics yield different rankings. Peak confidence measures provided by the same program can produce quite different rankings with different proportions of high confidence motifs. Ranking of MACS peak list by three different confidence measures (1st in figure legend indicates the primary means of sorting, the 2nd measure is used to break any ties). Analysis as in Figures S1–S3.
(0.82 MB EPS)
Motif stringency thresholds. Using either A) less stringent (p<1×10−5) or B) more stringent (p<1×10−8) thresholds for defining “significant” NRSE2 motifs found by MAST search within 500 bp of the peak did not change the relative ranking of the eleven tested methods. Compare with main text Figure 6.
(0.50 MB EPS)
Peaks called from simulated. A) Number of peaks called in from normal and simulated datasets at different noise levels. B) Percent decrease in the number of peaks called by each program was calculated as the difference between the normal and simulated datasets divided by the size of normal dataset.
(1.08 MB EPS)
Motif content in ranked peaks from simulated noisy datasets. Panels show the change in motif content throughout the peak lists in Johnson et al. 's unpertubed ChIP sample and 10–50% noise introduction from background sequence for each program.
(0.08 MB PDF)
Variation in width of peak regions reported by different ChIP-Seq peak callers for the NRSF dataset. The width of each peak was calculated as the difference between start and stop coordinates. Continuous density plots were generated to display the frequency with which different peak widths were observed in the lists reported by different peak calling programs. SISSRS and spp package programs (mtc and wtd) were not included as these methods report fixed width or single coordinate, respectively.
ChIP-seq has become a widely adopted genomic assay in recent years to determine binding sites for transcription factors or enrichments for specific histone modifications. Beside detection of enriched or bound regions, an important question is to determine differences between conditions. While this is a common analysis for gene expression, for which a large number of computational approaches have been validated, the same question for ChIP-seq is particularly challenging owing to the complexity of ChIP-seq data in terms of noisiness and variability. Many different tools have been developed and published in recent years. However, a comprehensive comparison and review of these tools is still missing. Here, we have reviewed 14 tools, which have been developed to determine differential enrichment between two conditions. They differ in their algorithmic setups, and also in the range of applicability. Hence, we have benchmarked these tools on real data sets for transcription factors and histone modifications, as well as on simulated data sets to quantitatively evaluate their performance. Overall, there is a great variety in the type of signal detected by these tools with a surprisingly low level of agreement. Depending on the type of analysis performed, the choice of method will crucially impact the outcome.
ChIP-seq, differential analysis, software
High-throughput sequencing (HTS) has become a standard method in genomics research and has almost completely superseded array-based technologies, owing to the ever-decreasing costs and the variety of different assays that are based on short read sequencing. Most array-based assays have now a counterpart based on HTS, with a generally improved dynamic range in the signal. Genome sequence (whole genome or exome), DNA-methylation (whole genome bisulfite sequencing), gene expression (RNA-seq, CAGE-seq), chromatin accessibility (DNAse1-seq, ATAC-seq, FAIRE-seq) or chromatin interaction (ChIP-seq) all belong to the standard repertoire of genomic studies, and follow standardized protocols. However, the broad availability of these approaches should not hide the fact that they are still highly complex, requiring a number of experimental steps that can lead to considerable differences in the readout for a same assay performed by different groups [1, 2]. Large-scale consortia such as ENCODE or Roadmap, Epigenomics, which rely on different sequencing centers for the data collection, have faced the problem of harmonizing the results obtained by different centers, which require systematic bias correction before data integration can be achieved in a meaningful way. Clearly, the more complex the experimental setup is, the more it is subject to biases, which can be introduced in the different steps of the experimental protocol or the downstream analysis [3, 4]. Among the approaches listed previously, those based on immunoprecipitation are the more complex ones, as the antibody-based precipitation usually represents a critical step, and leads to variations in the precipitation efficiency, the cross-reaction probability, conditioned by the quality of the antibody. Hence, the reproducibility of the assays is often limited, especially in cases with additional constraints, for example low input material. The amount of noise in the data can be substantial: a standard measure of the signal-to-noise ratio is the FRiP (fraction of reads in peaks), which measures how many sequencing reads are located in enriched regions, compared with the total amount . In the ENCODE project, this ratio was in the range of a few percent, indicating that the amount of noise is >90%. To detect consistent signal between replicates, special statistical methods have been developed such as the Irreproducible Discovery Rate , which allow to detect consistent signal between replicates.
Detecting differential gene expression between several conditions is one of the most common analysis steps since the advent of genome-wide expression measurements based on microarrays, and a considerable literature has been dedicated to the development of solid statistical procedures. Expression measurements based on HTS has also lead to the development of new tools or the adjustment of existing procedures. Differential analysis are however not restricted to gene expression but can in principle be extended to any quantitative assay, such as the measurement of DNA methylation levels or the enrichment of ChIP signal. Accordingly, many tools are available either to detect differential gene expression or to delimit differentially methylated regions (DMRs). Similarly, a number of tools and methods have been published to detect differences in ChIP signal between several conditions [6–19]. If the underlying question is similar for ChIP-seq experiments (where are the regions showing significant differential signal between two conditions?), there are a number of particularities that make this simple question particularly challenging in the case of ChIP-seq data sets, as compared with RNA-seq or whole-genome bisulfite sequencing. In the case of RNA-seq, most of the signal is concentrated in regions that are either annotated as genes or easily recognized as enriched regions, representing unannotated transcripts. Hence, the search space for differential signal is well defined. In the case of differential DNA methylation, the search space is extended to the complete genome, as CpGs occur virtually everywhere. However, the signal range is constrained to a finite interval (between 0% and 100% methylation) and the amount of noise is low, two properties which make the search for DMRs a tractable problem. In the case of ChIP-seq, we are facing multiple challenges: (i) the search space is not limited to a particular region of the genome, as differential binding can occur everywhere; this implies that regions of interest, in which differential signal should be looked for, need to be defined first; (ii) the range of the signal is not constrained to a finite interval and requires transformation such that standard statistical tools can be applied; (iii) the amount of noise in considerable, making variations in the signal challenging to detect, especially when these differences are subtle, and (iv) the properties of the enriched regions (in particular their length) differ substantially depending on the protein or epigenetic modification targeted by the immunoprecipitation.
In this article, we have performed a comprehensive comparison of a large number of tools, which have been published to detect differential signals in ChIP-seq assays. Our criterion for tool selection was the availability of a working software that could be implemented without the need for extensive efforts for porting the code. These tools differ in many ways, and this diversity reflects the diverse challenges listed previously: some require preliminary detection of enriched regions by external peak-calling algorithms, while others implement their own detection method or work using sliding windows; others differ in the underlying statistical modeling of the signal distribution, based either on Poisson distribution or on a more flexible negative binomial distribution. Finally, some tools work in the absence of replicates for each condition, and others require replicates to provide differential analysis. Importantly, some tools have been specifically designed for particular ChIP-seq data, such as histone modifications or transcription factor (TF) binding. We have indicated this in the tool description (Supplementary Table S1); however, we have tried to apply these tools to all types of data sets to verify their behavior. For the evaluation, and given the absence of a gold standard for differential enrichment in ChIP signal, we have adopted a multistep strategy. First, the tools have been compared using various published ChIP-seq data sets, to compare global statistics on the sets of differential regions (DR) detect by each tool, such as the number of DR, their length distribution and the pairwise overlap between the outputs. We have made the distinction between tools that do not require replicates for the definition of DR (single-replicate) and those that only work when replicates are available (multi-replicates), and performed all analysis for both categories separately. To estimate rates of true positives and false positives, we have performed a simulation of synthetic data sets, but made sure that these simulated data sets are as close as possible to real data sets and represent realistic and fair test sets. Using these data sets, we have compared the sensitivity and specificity of each tool. In particular, we have investigated the threshold in differential signal that is required for each tool to start detecting a significant difference. Finally, we have conducted a functional annotation of the sets of DR and related the DR to differentially expressed genes (DEG). This last analysis is meant to address typical biological questions that are mostly focused at difference in gene expression driven by differential enrichments.
Overall, we have seen considerable differences between the tools, which can be traced back to the underlying statistical assumptions or the way the initial search space is defined, either based on peak calling or using a windowing approach. Researchers that are interested in finding DR in their study should be aware of these differences, and should adapt the choice of the tool to their particular experimental question. We believe that our extensive study can help in making a motivated choice and avoid obvious pitfalls.
Material and methods
We have selected 14 available tools for differential peak calling, based on a variety of algorithmic approaches [6–19]. A detailed description of each tool is given in Supplementary Text 1, whereas the parameters used are described in Supplementary Table S1.
Data set sources
A summary of the data sets used and statistics on the number of reads is given in Supplementary Table S2.
Transcription factor data
We obtained FoxA1 ChIP-seq data from a previously published study for two experimental conditions including estradiol (E2)- and vehicle (Veh)-treated MCF7 cells each in two biological replicates (GEO: GSE59530). Additionally, expression data as already aligned reads for both conditions derived from global run-on sequencing (GRO-seq) were taken from this study (GEO:GSE59531) .
Sharp histone post translational modification (PTM) data
Two biological replicates of H3K27ac ChIP-seq were retrieved from a study that differentiates embryonic stem cells (hESC-H1) to mesenchymal stem cells (GEO: GSE16256) .
Broad histone PTM data
Moreover, we collected data sets in two biological replicates for H3K36me3 in an MMSET (multiple myeloma SET domain) overexpression (TKO) against physiological expression (NTKO) setup in myeloma cells (GEO: GSE57632) .
Simulated sharp signal data
As an example of data with localized signal giving rise to sharp peaks, we used the merged replicates of the previously described FoxA1 ChIP in E2-treated MCF7 cells as a source for our simulation of a sharp signal data set (GEO: GSE59530) .
Simulated broad signal data
Moreover, two biological replicates of H3K36me3 in MCF7 cells were downloaded and pooled, which was used as source for the simulation of a broad signal data set (GEO: GSE31755) .
Input data for background simulation
As a control data set, we used DNA, which is not precipitated (‘input DNA'), which represents the standard control in ChIP-seq experiments. Five input experiments from different MCF7 studies were collected and further used to generate an additive background model (GEO: GSM1059389, GSM1089816, GSM1143667, GSM1241757 and GSM1383864). We performed a genome-wide correlation of the input signals to verify their similarities and pooled them together as source for our background signal simulation. Data simulation was then performed as described in the section ‘Simulation of a differential dataset'.
ChIP-seq preprocessing and analysis
The data sets were downloaded as sra files from the Sequence Read Archive (SRA) and converted to fastq with the SRA Toolkit. Reads were aligned to the hg19 reference genome using BWA aln with default settings . Subsequently, duplicated reads and reads with bad mapping quality (-q 1) were removed from alignments.
We performed peak calling on all ChIP-seq data sets using MACS2 with additional parameters according to underlying signal type as followed: for sharp peaks ‘-g hs -q 0.1 –call-summits' and for broad peaks ‘-g hs -q 0.1 –broad' . Resulting peak sets were used as input for tools that require peaks/regions as input for differential enrichment analysis.
Differential peak calling
We performed differential peak calling with each tool according to the settings recommended by the developers of the tools or, if not indicated explicitly, with default settings as listed in Supplementary Table S1.
Moreover, tools were classified into two categories according to their ability to use biological replicates or only single samples (Supplementary Table S1).
For the tools that do not consider replicates, the replicates of each ChIP-seq experiment were pooled to create a single data set for each condition. Because each of these tools calculates different significant measures, a general significant threshold for all of them could not be defined (Supplementary Table S1). However, for tools using an false-discovery rate (FDR) or P-value measure, we decided to set the thresholds as FDR ≤ = 0.01 or P-value ≤ = 0.05.
We obtained already aligned GRO-seq reads for the FoxA1 data set as an input for groHMM, which is an R pipeline for the analysis of GRO-seq data . Differential GRO-seq analysis was performed according to the manual of groHMM using edgeR with a significance threshold of P ≤ = 0.05 [25, 26]. Resulting list of DEGs between E2- and Veh-treated MCF7 cells was used for further integration with differential peak data sets.
DEG enrichment analysis
Lists of DEGs for the H3K27ac and H3K36me3 data sets were taken from the corresponding studies to integrate them with computed differential peak sets. For FoxA1, we used the results of the GRO-seq analysis described above. Additionally, a list of house-keeping genes (HKGs) was obtained (http://www.wikicell.org/index.php/HK_Gene) and used as a control to compare enrichment of DEGs with HKGs.
In addition, we introduced for each ChIP-seq data set another control experiment: we ranked all genes using the coverage fold-change within gene bodies (H3K36me3) and promoters (FoxA1 / H3K27Ac). We used these rankings to perform the same recovery analysis.
Gene ontology enrichment analysis
As large variations in terms of ‘number of differential peaks' were observable for all data sets, we decided to use a ‘gene centric' approach, focusing on the top 1000 genes that were close to a differential region. Therefore, each peak was annotated with its nearest gene using the R-package ChIPseeker . We ranked these genes according to corresponding peak significance measures and obtained the top 1000 unique genes for each differential peak set. Regions ±1.5 kb around the consensus transcription start side (TSS) were used as input for the annotation tool GREAT with the setting 'single nearest gene' .
As the transcription factors or histone modifications studied here are considered to be positively correlated with gene expression, we used DEGs as a positive control. Therefore, regions ±1.5 kb around the consensus TSS of these genes were used as an input for a GREAT analysis with the setting ‘single nearest gene'.
Differential peak calling data presentation
Signal tracks were generated for each replicate using deepTools . For sharp signal data, scaling factors were estimated with the signal extraction scaling (SES) option, and further log2 ratios of ChIP over input signal were computed . Broad signal data were scaled according to the total number of reads and again log2 ratios were calculated.
Additionally, all differential peaks can be displayed in a single genome-wide coverage track, which was generated using the bedtools genomecov function .
All differential peak sets, signal tracks and genome-wide coverage tracks can be displayed using following UCSC genome browser trackhubs:
Simulation of a differential data sets
To simulate a realistic ‘gold standard' data set for differential ChIP enrichment, we started from real ChIP data sets: one data set targeting FoxA1, as an example of a data set with sharp signal, and a H3K36me3 data set as an example of a broad enrichment. In each case (FoxA1 and H3K36me3), we simulated two ‘treatment' data sets (T1 and T2) representing the reference condition and the comparison condition, as well as ‘control' data sets (C1 and C2), corresponding to input. The treatment data sets (T) were obtained by merging regions corresponding to real signal (S) with background noise (B) (hence, T = S + B) The signal data set was obtained as follows: we performed peak calling using MACS2, using the provided input data set as a control. We selected the top 20,000 peaks identified by the peak caller, as they are most likely to represent truly enriched regions. In a first step, only the reads located in these regions were considered, and represent the reference signal data set (S1). To simulate the second condition, these peak regions were split into two groups: 10,000 peaks were kept as is (S20), and represent nondifferential peaks. The remaining 10,000 peaks were divided into 10 groups of 1000 peaks each, which were downsampled by random read sampling by 10%, 20%, etc. (S210,S220,…). These represent sets of differential peaks in which the level of differential enrichment is variable. Next, we simulated realistic background signal B: several public input data sets for the MCF7 cell line were downloaded, aligned and compared. The input data set showing the highest correlation with the original input was used to sample reads, until the library size of the real ChIP-seq experiment was reached. If not enough reads were available in one input, it was merged with a second similar one. This is done twice, to simulate a background for the reference and the second condition (B1 and B2). In summary, our two data sets are composed of T1 = S1 + B1 and T2 = S2 + B2, where S2 = S20 + Σi = 10 → 100S2i. Comparing these two data sets should ideally yield 10,000 differential peaks, which represents our gold standard.
The Bioconductor packages IRanges and GenomicRanges  were used to detect overlapping peak regions. For the simulated data set, a GenomicRanges object comprising all true differential binding events (n = 10,000) was implemented as a reference to find overlaps with DR called by the respective tools. We considered a called DR to be a true event if at least 25% of its length overlapped with a real DR.
Using this data set, we computed two different measures to infer sensitivity and specificity of the analyzed tools. For differential ChIP-seq analysis, the number of negatives outnumbers by far the number of positives, as most of the genome is not differentially enriched. Hence, the specificity is not a good measure, as all tools will have artificially high specificity, given the large number of true negatives. We therefore prefer using a ‘precision'-like measure. To take into account the fact that some tools call large DR, we defined the ‘Jaccard index precision' as as a more stringent precision measure, penalizing imprecise peak calling (either over- or under-calling). Here, Ai represents a DR called by a tool and Bj(i) represents the true DR that intersects Ai. The norm is meant as the genomic length of the intersection or the union, and N indicates the number of DR called by the tool considered. We define the recall as the proportion of true DR that intersects a called DR with a 25% overlap.
Comparison using real data sets
We applied the single and multi-replicates pipelines to selected data sets, using the default or recommended settings for the tools (see Supplementary Table S1 for methods and parameters used). As mentioned in the Introduction, a major distinction can be made between tools that require a predefined set of regions of interest (for example, peaks) and those that implement an internal procedure to define these regions of interest. We call the first set of tools EROI (for external regions of interest), and the second IROI (for internal regions of interest), and distinguish these two groups (in addition to the requirement for replicates or not) in the subsequent analysis.
Our first question was whether the tools would report consistent sets of differential peaks.
For both types of ChIP-seq data sets (sharp TF binding and broad histone modification enrichment), we observed considerable differences in the number of reported DR (Figure 1A and B). The sets of DR ranged from no detected differential peak (QChIPat) to about 150,000 DR for ODIN-poisson for the FoxA1 data set. Five of the methods had a relatively consistent number of DR, in the range of 25,000–35,000 peaks. Among the tools that were given the same set of predetermined regions using an external peak caller (MACS2 in this case), the number of DR does not show particular consistency, with MAnorm having about 23,000 DR and QChIPat only a handful, indicating that the number of input regions is not the primary determinant of the number of detected DR. Obviously, the number of DR depends on the choice of parameters, which we have chosen in a consistent way across tools. We were interested in determining how sensitive the number of peaks is on this choice. Hence, we have varied the threshold on the P-value or FDR, and recorded the number of DR (Figures 1C and 2C). Surprisingly, the IROI tools from the single replicate pipeline show a step-like behavior, with a pronounced jump at small P-values, and a quasi saturation beyond this threshold. Other tools show a more progressive increase. Hence, the choice of P-value threshold is crucial for the first class of tools, as this number can vary abruptly by several orders of magnitude. As for the tools with a small number of DR (MMDiff, DBChIP), this number indeed increases when relaxing the threshold, but remains at a low level, much lower than other tools from the same class, indicating an intrinsic difference in the internal statistical procedure to call DR.
Overview of the DR called by the single and multi-replicate pipelines for the FoxA1 data set: (A) Number (barplots) and size distribution (violin plots) of detected DR in each condition and size distribution for the single replicate pipeline. (B) Same as shown in A for the multi-replicate pipeline. (C) Plots showing the number of DR returned by each method as a function of the P-value threshold (plain lines) or FDR (dotted lines). (D) An example of a region, highlighting the difference between the different tools in terms of length of DR. A colour version of this figure is available online at BIB online: https://academic.oup.com/bib.
Overview of the DR called by the single and multi-replicate pipelines for the FoxA1 data set: (A) Number (barplots) and size distribution (violin plots) of detected DR in each condition and size distribution for the single replicate pipeline. (B) Same as shown in A for the multi-replicate pipeline. (C) Plots showing the number of DR returned by each method as a function of the P-value threshold (plain lines) or FDR (dotted lines). (D) An example of a region, highlighting the difference between the different tools in terms of length of DR. A colour version of this figure is available online at BIB online: https://academic.oup.com/bib.