This tool attempts to help determine the relationship of one set of genomic features to another, by executing a set of permutation tests, and by producing a set of graphs to help visualize the results. The basic idea of the permutation testing is to randomly change the location of the regions of interest in the genome, counting the number of overlaps with the feature of interest, to determine the distribution of overlap counts you would expect to see between the features and the regions by chance. This tool does permutation tests on three different, but related, measures: the number of regions that contain a feature; the number of features that overlap a region; and the number of base pairs of overlap between regions and features.
To test the relationship of regions A to features B using 10000 permutations run on 4 cores, run the following command:
python overlap_bps_with_feature.py A.bed "Region A" B.bed "Feature B" genome.fa.fai --permutations 1000 --cores 4
The results will be placed in the directory A_to_B
and contain a file results.txt
that has the results of the
permutation testing:
REGION_HITS REAL Region A Feature B 46
REGION_HITS QUANTILE Region A Feature B 0.823
FEATURE_HITS REAL Region A Feature B 110
FEATURE_HITS QUANTILE Region A Feature B 0.999
OVERLAP REAL Region A Feature B 753149
OVERLAP QUANTILE Region A Feature B 0.933
The lines marked "REAL" have the actual observed measures for the feature and region sets. The lines marked "QUANTILE" show the quantile of the real observed value in the distribution measured by the permutations; this can be used to estimate a p-value for enrichment (or un-enrichment, if it is low).
There will also be a pdf file, A_to_B.pdf
, which contains plots of the permutation
distributions of each of the three measures described above, as well as a series of
plots of the three measures when the regions are shifted fixed amounts to the left and
to the right of their current locations. You can see an example plot here.
The following programs need to be installed and in the path of the user executing the script:
- BEDTools https://code.google.com/p/bedtools/ (tested with 2.17.0)
- BEDOps https://code.google.com/p/bedops/ (tested with 2.10)
- R (Specifically Rscript) (tested with 2.14)
The Python script requires the argparse and multiprocessing libraries The R script requires the ggplot2 and grid libraries
The script requires three input files:
- Region BED file
- Feature BED file
These are BED files with the locations of your regions and features. For proper results, these should be sorted and merged (with bedtools mergeBed) ahead of time.
- Genome index file
This is a list of the chromosome names and their lengths in the genome of interest,
of the type generated by samtools faidx
.
Optionally you can also use a gaps file. This is a BED file with the locations of gaps in the genome reference, or other regions you want to be excluded from the possible locations used in the location permutations.
Usage for the script is:
usage: overlap_bps_with_feature.py [-h] [--gaps GAPS]
[--permutations PERMUTATIONS]
[--cores CORES] [--replot] [--shift_only]
[--process_iteration_chunk PROCESS_ITERATION_CHUNK]
[--iteration_number ITERATION_NUMBER]
[--iteration_chunk_size ITERATION_CHUNK_SIZE]
[--use_condor] [--verbose]
bp_file bp_name feature_file feature_name
genome
The options are:
gaps
: A BED file containing regions of the genome to exclude from location permutationpermutations
: The number of random location permutations to computecores
: The number of simultaneous processes to launch to run permutationsreplot
: if this option is used, the script will not do any new testing but will use the data files stored in the results directory to redraw the plotsshift_only
: skip the permutation testing and only generate the shift figuresuse_condor
: execute the permutations by spawning new tasks to be run on a compute grid using the condor scheduling engineiteration_chunk_size
: number of permutations to compute in a single process (only useful if you are playing with distrbuting processes on condor)verbose
: print extra debug/logging informationprocess_iteration_chunk
/iteration_number
/iteration_chunk_size
: do not use; these parameters are used when distributing tasks to condor
There are a variety of other techniques for computing the significance of enrichment between a set of regions and a set of features. Most of them are more sophisticated than this script. Here are a few to investigate:
- Statistics in BEDtools (with supposedly many more in the forthcoming BEDTools2)
- GenometriCorr
- The Genomic Hyperbrowser
- The ENCODE GSC tool
- EpiExplorer