-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the 2016_project_5 wiki!
Project name: "A framework to evaluate profiles from DNA-binding site collections represented in peak sequences from ChIP-Seq assays."
Our challenge here is to present a software framework allowing users to evaluate representation of a given motif (pattern) in a set of DNA sequences. These sequences are "peaks" from a ChIP-seq assay. In general, these peaks can represent the approximate location (site) where a transcription factor (TF), which is type of protein, will bind to regulate transcription of a gene. These are known as Transcription Factor Binding Sites.
On one part, we have repositories of ChIP-seq assays. You can expect to find:
- The peaks (encoded by a list of tab-separated values as .bed file). The .bed file contains a chromosome, start location and end location for the peak.
- The target of the experiment, representing the TF of interest.
- Additional metadata, such as the tissue/cell line used in the assay, the organism, the experiment authors etc.
#CHROMOSOME START STOP NAME STRAND
chr1 164404 173864 ENST00000466557.1 0 -
chr1 235855 267253 ENST00000424587.1 0 -
chr1 317810 328455 ENST00000426316.1 0 +
See more at: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
There are also motif databases that we can use to scan the sequences contained by the bed file entries. These motifs are represented by a position-weight matrix (PWM), which is a matrix of nucleotides and weighted probabilities that a given nucleotide occurs at a given position. Since DNA has an alphabet of 4 (A,C,G,T), a PWM is a matrix of size 4 x N, where N is the motif length.
Position frequency matrix (PFM):
A [ 87 167 281 56 8 744 40 107 851 5 333 54 12 56 104 372 82 117 402 ]
C [291 145 49 800 903 13 528 433 11 0 3 12 0 8 733 13 482 322 181 ]
G [ 76 414 449 21 0 65 334 48 32 903 566 504 890 775 5 507 307 73 266 ]
T [459 187 134 36 2 91 11 324 18 3 9 341 8 71 67 17 37 396 59 ]
The PWM is a PFM converted such that every column sums up to 1.0 (including negatives).
Source: http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?ID=MA0139.1&rm=present&collection=CORE
Our framework should provide an easy way to compare ChIP-seq peaks with PWMs of interest. From there, we want to know information including:
- Number of peak matched per PWM
- p-value associated with each PWM match
- A summary statistic of the p-values (a q-value, for example)
- Sequence covered by each match
- The orientation (strand) for each sequence
- More statistics that could be of interest. We can compare different runs based on the ChIP-seq metadata. For example: How does the binding sequence differ from one tissue to another? Is there an effect on the p-value caused by the orientation of the sequence? Does the relative position on the peak affect the score of the match?
This output can be formatted as a text summary, but ideally we would also want to visualize the results in an HTML (possibly interactive) report. This can include:
- Visualization of the binding sequence over the peak
- Summary statistics regarding the position of the binding site
- Tabulation of p-values, q-value per experiment (If we are comparing multiple runs.)
The project can be divided into 3 main components that can be worked on in parallel. Of course, we can change anything we want; this is merely a proposed design to help us settle in the ideas:
- The Initializer: The user inputs the dataset and motif to scan (Draft design: https://drive.google.com/file/d/0B2SxyuYUNY4aU0c0ZlRFZHJXNkk/view?usp=sharing)
- The Experimenter: The component used to compute matches and experiment statistics (Draft design: https://drive.google.com/file/d/0B2SxyuYUNY4aeVBLc2x5RnJPWWs/view?usp=sharing)
- The Visualizer: A generated summary of the experiment with an HTML report (Draft design: https://drive.google.com/file/d/0B2SxyuYUNY4aNnJHMG9WS3RzTjA/view?usp=sharing)
The design charts are simply to help us visualize the design and how the different pieces work together. Again, I expect we're going to modify this, and there are probably flaws in the current draft. It is just a draft :) It would be nice to revisit this and update the chart with our final design once we're done.
The Experimenter is somewhat the "back-end" of the framework. The Initializer and Visualizer are the input and output respectively of the project. It's possible to have a graphical interface for either of these, depending on how much interest there is in building front-end components. It would also be fine if the initializer is a command-line interface.
The issues can be used to create and breakdown tasks: https://github.com/hackseq/2016_project_5/issues
The wiki (here) is where we can store information about the tools, file formats and databases we'll be using.