Skip to content
Manuel Belmadani edited this page Sep 17, 2016 · 6 revisions

Welcome to the 2016_project_5 wiki!

Project name: "A framework to evaluate profiles from DNA-binding site collections represented in peak sequences from ChIP-Seq assays."

Team members: Remember to fill in the survey of interests! https://goo.gl/forms/lp3GKNxwpe1Oxd2I2

Questions

We're looking for some biological questions we can answer using our tools. Current ideas include:

-How to transcription factor binding genes differ across tissue, cell types and states? -Comparing transcription factor binding sites with histone marks and methylation signatures.

Introduction

Our challenge here is to present a software framework allowing users to evaluate representation of a given motif (pattern) in a set of DNA sequences. These sequences are "peaks" from a ChIP-seq assay. In general, these peaks can represent the approximate location (site) where a transcription factor (TF), which is type of protein, will bind to regulate transcription of a gene. These are known as Transcription Factor Binding Sites.

On one part, we have repositories of ChIP-seq assays. You can expect to find:

  1. The peaks (encoded by a list of tab-separated values as .bed file). The .bed file contains a chromosome, start location and end location for the peak.
  2. The target of the experiment, representing the TF of interest.
  3. Additional metadata, such as the tissue/cell line used in the assay, the organism, the experiment authors etc.

Bed file example

#CHROMOSOME START STOP NAME STRAND
chr1 164404 173864 ENST00000466557.1 0 -
chr1 235855 267253 ENST00000424587.1 0 -
chr1 317810 328455 ENST00000426316.1 0 +

See more at: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

There are also motif databases that we can use to scan the sequences contained by the bed file entries. These motifs are represented by a position-weight matrix (PWM), which is a matrix of nucleotides and weighted probabilities that a given nucleotide occurs at a given position. Since DNA has an alphabet of 4 (A,C,G,T), a PWM is a matrix of size 4 x N, where N is the motif length.

PWM Example

Position frequency matrix (PFM):

A [ 87 167 281 56 8 744 40 107 851 5 333 54 12 56 104 372 82 117 402 ]
C [291 145 49 800 903 13 528 433 11 0 3 12 0 8 733 13 482 322 181 ]
G [ 76 414 449 21 0 65 334 48 32 903 566 504 890 775 5 507 307 73 266 ]
T [459 187 134 36 2 91 11 324 18 3 9 341 8 71 67 17 37 396 59 ]

The PWM is a PFM converted such that every column sums up to 1.0 (including negatives).

A position weight matrix (PWM) logo

Source: http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?ID=MA0139.1&rm=present&collection=CORE

Our framework should provide an easy way to compare ChIP-seq peaks with PWMs of interest. From there, we want to know information including:

  1. Number of peak matched per PWM
  2. p-value associated with each PWM match
  3. A summary statistic of the p-values (a q-value, for example)
  4. Sequence covered by each match
  5. The orientation (strand) for each sequence
  6. More statistics that could be of interest. We can compare different runs based on the ChIP-seq metadata. For example: How does the binding sequence differ from one tissue to another? Is there an effect on the p-value caused by the orientation of the sequence? Does the relative position on the peak affect the score of the match?

This output can be formatted as a text summary, but ideally we would also want to visualize the results in an HTML (possibly interactive) report. This can include:

  1. Visualization of the binding sequence over the peak
  2. Summary statistics regarding the position of the binding site
  3. Tabulation of p-values, q-value per experiment (If we are comparing multiple runs.)

The project can be divided into 3 main components that can be worked on in parallel. Of course, we can change anything we want; this is merely a proposed design to help us settle in the ideas:

  1. The Initializer: The user inputs the dataset and motif to scan (Draft design: https://drive.google.com/file/d/0B2SxyuYUNY4aU0c0ZlRFZHJXNkk/view?usp=sharing)
  2. The Experimenter: The component used to compute matches and experiment statistics (Draft design: https://drive.google.com/file/d/0B2SxyuYUNY4aeVBLc2x5RnJPWWs/view?usp=sharing)
  3. The Visualizer: A generated summary of the experiment with an HTML report (Draft design: https://drive.google.com/file/d/0B2SxyuYUNY4aNnJHMG9WS3RzTjA/view?usp=sharing)

The design charts are simply to help us visualize the design and how the different pieces work together. Again, I expect we're going to modify this, and there are probably flaws in the current draft. It is just a draft :) It would be nice to revisit this and update the chart with our final design once we're done.

The Experimenter is somewhat the "back-end" of the framework. The Initializer and Visualizer are the input and output respectively of the project. It's possible to have a graphical interface for either of these, depending on how much interest there is in building front-end components. It would also be fine if the initializer is a command-line interface.

The issues can be used to create and breakdown tasks: https://github.com/hackseq/2016_project_5/issues
The wiki (here) is where we can store information about the tools, file formats and databases we'll be using.