Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to simulate simple 3D structure in Hi-C/3C reads #39

Open
1 of 3 tasks
koadman opened this issue Aug 26, 2016 · 17 comments
Open
1 of 3 tasks

Ability to simulate simple 3D structure in Hi-C/3C reads #39

koadman opened this issue Aug 26, 2016 · 17 comments

Comments

@koadman
Copy link
Collaborator

koadman commented Aug 26, 2016

We have discussed various ways of doing this. Some options we covered were:

  • drawing read pairs with frequencies given by a user-provided contact map
  • drawing read pairs from a parameterized function which specifies, e.g. circularity or some other chromosome conformation

Challenges involved were/are:

  • simulating sequencing error profiles with read pairs at arbitrary distance, which we thought could be addressed by using artsim or some other tool to simulate an excess of coverage and then drawing out read pairs at appropriate distances (that weren't originally simulated as a pair)
  • A user-provided contact map may not have the same genomic range as the evolved genomes. e.g. if a user provides a 5000x5000 contact map in 1kbp bins, this would have to be scaled to the size of each simulated genome.
  • It is not clear what other parametric functions might be appropriate, other than the one for circularity which specifies additional contact between chromosome arms.
@cerebis
Copy link
Owner

cerebis commented Sep 15, 2016

First port of call, the easiest structural detail to approximate is the single primary fold of a circular chromosome, leading to a second anti-diagonal intensity profile in a heatmap. I would suggest this can be represented as a reflection of the primary distribution with a lower intensity.

@cerebis
Copy link
Owner

cerebis commented Sep 15, 2016

Are we treating 3D structure in a secondary or tertiary sense? That is: low observational frequency driven by inaccessible regions due to tight folding, or lets say increased obs freq because of physical proximity due to consistent 3D structural elements (folds)?

@koadman
Copy link
Collaborator Author

koadman commented Sep 15, 2016

First port of call sounds great.
Wouldn't 2ndary & tertiary become mixed in the observed Hi-C read pair frequencies? I think it's probably enough to consider just the tertiary at first.

@cerebis
Copy link
Owner

cerebis commented Sep 18, 2016

Prototyping generators.

Initially, I made the mistake of thinking that the anti-diagonal was product of interacting flattened circular chromosomes. Visually, pull a rubber band tight, and opposing strands come into close proximity.

In pursuing that, I cobbled together a multi-modal variation of our empirical distribution, where the second smaller maxima was at the halfway point and followed the same geometrical relationship we've imposed. Well, that turns out to make parallel lines of density in the heatmap (parallel to the primary y=x), with a simple phase shifts of +L/2, -L/2. Sorta, duh! Clearly that won't creating orthogonal traces in the density, where a rotation is necessary, a simple transformation (y=L-y).

So, along comes the correct treatment. If you combine this with the above (because well, I had it coded) you get something that looks like this. There is no need to keep the multi-modal distribution, but its a nice patch-work quilt.

figure_1

or with the regular distribution

figure_1

One thing that is bothering me is the apparent aliasing within these heatmaps. I am hopeful that this is an artifact on the plotter.

@koadman
Copy link
Collaborator Author

koadman commented Sep 19, 2016

Nice one, looks warm & cozy.
looking at this i notice now that in some of the real datasets we've looked at there are bigger blobs of heat in the top right & bottom left than are showing up in your simulated data. i guess that suggests some additional contact in the origin region in the population, beyond what can be explained by the flattened circle model.

@cerebis
Copy link
Owner

cerebis commented Sep 19, 2016

Possible confirmation that streaking is related to Matplotlib. The same matrix as above in R heatmap.

rfig

@cerebis
Copy link
Owner

cerebis commented Sep 19, 2016

Now fully derived from simulated reads => BWA => contact map. This was from a community of 8 genomes, derived in two clades of 5 and 3 taxa.

Doesn't look bad. Only 50k pairs, so intensity is weak. (Lowered CM resolution to gain signal.)

outd raw

@koadman
Copy link
Collaborator Author

koadman commented Sep 19, 2016

nice. so the off-diagonal blocks with heat are presumably the result of closely related strains that have had mismapped read pairs? i see a cluster of 3 there, and the other 5 must have one pair that are close as well...
anyhow, looks like it's working. would you agree?

@cerebis
Copy link
Owner

cerebis commented Sep 22, 2016

First prototype of random intervals representing CIDs. Each interval has its own empirical distribution, so can be refined ad nauseum. The probability of being CID related is too high here (starving the full range), but was aimed and seeing their result.

Starting to remind me of 1D mappings.

intervals

@cerebis
Copy link
Owner

cerebis commented Sep 23, 2016

Taking the terminology from Tung et al, I have implemented a model of replicon CID regions. These chromosomal interaction domains are what attributed in producing the block density in the heatmap. Stay tuned for a pic via read-mapping.

@cerebis
Copy link
Owner

cerebis commented Sep 23, 2016

cid_1gen raw

My python script only does the upper triangle. The CIDs are randomly generated and in this case turned out a little sparse. (single 3Mbp genome, 250k HiC pairs)

@cerebis
Copy link
Owner

cerebis commented Sep 23, 2016

cid_1gen raw

new seed, 3Mbp genome, 1M HiC pairs

@cerebis
Copy link
Owner

cerebis commented Sep 23, 2016

comparison

Critique against the figure in the paper

  • The detail is hard to see due to insufficient signal at the given resolution (10 sites per bin).
  • Some smaller sites are possibly too small and the minimum could be raised.
  • I need to confirm that the depletion of the background (whole chromosome) distribution is happening along both axes. This may only be happening along the x-axis, suggesting a small oversight in the generation of coordinates.
  • real-data: are the CIDs visible on the anti-diagonal? Maybe?
  • real-data: are the upper and lower halves reflections?
  • intensity of CIDs is possibly too high or the distribution needs more fall-off over its extent. (Geometric shape parameter in this case)

@koadman
Copy link
Collaborator Author

koadman commented Sep 23, 2016

this looks really neat. i don't have a good sense in how much value there is in perfecting it beyond this point -- maybe need to see some use cases for guidance.

@cerebis
Copy link
Owner

cerebis commented Oct 4, 2016

31925f5

@cerebis
Copy link
Owner

cerebis commented Oct 25, 2016

I think I will close this issue for now.

@cerebis cerebis closed this as completed Oct 25, 2016
@cerebis
Copy link
Owner

cerebis commented Dec 22, 2016

A new method for generating CID regions has been implemented. Rather than random intervals, this approach assumes folding regions (interacting domains) are better modelled as a nested (hierarchical) set of intervals.

I am making a note here and posting an example contact map. This map is by generated coordinates only, not a full simulation with actual read-mapping.

In implementing this method, there is a new issue #70 to fine-tune it. This is regarding the sizes of intervals, the probabilities assigned to them.

This figure depicts E.coli chr1, fake chr2 and S.aures chr1. Therefore there are also inter-replicon contacts between chr1 and chr2 of E.coli.

nested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants